I thought I had nothing to say about Ofsted’s proposed use of book scrutiny because there is no research on book scrutiny (I wonder why?!?). As far I am concerned, even if Ofsted’s own research could show inspectors can rate books with 100% consistency, it would still be a terrible idea. The effect of the framework on daily school practice matters far more than our ability to judge schools reliably and ensuring books are ‘Ofsted ready’ at all times is so much worse than just having to keep your special ‘Ofsted lesson plan’ in your top drawer. At Teacher Tapp we’ve started keeping an eye on the book scrutiny monster. But since I had a couple of spare hours watching sports day today, I decided to read and reflect on the Ofsted book scrutiny research. Here’s my perspective on it.
Now, it is important to say up front that the research Ofsted carried out on book scrutiny has NOTHING to do with their latest ideas about the role of book scrutiny in inspection. Interpretation of the new framework is something of a moving object at the moment, but, as far as I understand it from Ofsted employees, book scrutiny will only be used to ensure curriculum compliance (i.e. you taught the things you said you would). They tell me it does not form part of the assessment of impact (it is a mystery how that is measured), but does contribute to the ‘quality of education’ judgment overall through something impenetrable called ‘triangulation’. Except when you read the Inspection handbook it does say that work scrutiny can be used ‘…to form of view of whether pupils know more and can do more…’ Confused? I certainly am.
Book scrutiny is a highly non-standardised form of assessment
The Ofsted research was carried out to assess whether inspectors can rate reliably using book scrutiny indicators, of which two are closely related to assessing whether pupils know more and can do more. We know the circumstances under which assessment is reliable – when it is a standardised assessment with known reliability of a standardised curriculum sat in standardised assessment conditions with a standardised appreciation of the stakes. Book scrutiny stands up poorly to these:
- The rubric used in the research study relied on performance descriptors, something we’ve been trying to shake off in assessing learning elsewhere. The inspectors in the study admit that the language of ‘some‘, ‘sufficient‘, ‘considerable‘ is too vague to rate against. This is inevitable if we try to construct generic rubrics to be used across multiple subjects, curricula and age groups. It can’t be fixed by ‘better rubrics’ – they don’t exist. (See Daisy Christodoulou’s blogs for the problems with performance descriptors.)
- We know that the reliability of book scrutiny is quite low for some of the indicators used in the research (which remember WON’T be used in inspection!!!). The report says they will try to resolve this through more subject documentation and training to help turn an English teacher into a science inspector. I look forward to learning more about that sorcery.
- Observing books across different curricula increases the complexity of the challenge – it will be hard to make comparative judgments on how students learn about cell structure versus learning about light waves, yet this is what non-specialists will need to do. Equally, comparisons for the same student learning about different parts of the knowledge domain (e.g. studying first cells and then light waves) are far more difficult than comparisons within a domain (e.g. development in sentence construction over time).
- The challenge in judging progress over time within an exercise book is considerable, as it is for measuring progress between two testing points. (It sounds like Ofsted don’t plan to do this, though it was part of the research. Clarity about book storage is urgently needed here.) Whether we like it or not, the progress of individual students is slow compared to the variability of achievement within the age cohort. And this is as true for progress in a book as it is for progress across two tests.
Book scrutiny is more like Year 6 writing moderation than it is Year 4 multiplication check. It is more like a BTEC portfolio than it is a GCSE examination. In our high-stakes accountability system, of which Ofsted is part, we know exactly what happens when we hold teachers to account for teacher-malleable, vague-rubric assessment approaches, whether they are to measure ‘implementation’ or to measure ‘impact’ or some other ‘i’ yet to be determined.
The more alike schools are, the greater the reliability challenge
I started messing about with simulation data on book scrutiny reliability to work out whether 0.5 reliability was acceptably high, but quickly stopped when I realised that we cannot say whether it is high enough to contribute to measuring school quality without knowing the variation in provision that we need inspectors to discriminate between. Our difficulty is that most variation in provision takes place within rather than across schools.
I don’t envy Ofsted in needing to measure school quality at speed. School quality is a latent construct that we can only ever observe through the cloudy and slightly warped mirrors of test scores, book scrutiny, lesson observations and conversations. The success of education policy in ensuring under-performing schools are dealt with makes the challenge ever greater. Imagine you are asked to select the shoe box with the greatest number of sweets in it, where all you can do is peer through a series of tiny pin-holes. How well can you do at selecting the one with the most sweets? That rather depends on whether shoe boxes have 56, 59 and 62 sweets in or whether they have 10, 26 and 58 sweets in. Our challenge is that schools are much more like the former.
Ofsted’s approach to making these nuanced judgments between mostly-not-so-dysfunctional schools involves aggregating or triangulating lots of pieces of pretty unreliable ‘evidence’ on school quality. Aggregation is the friend of reliability but how straightforward triangulation turns out to be for Ofsted rather depends on whether each of these
unreliable ‘borderline moderately reliable‘ measures (e.g. book scrutiny for a subject; a class observation; a conversation with the head or teacher) consistently measures the same underlying construct. To the extent that there is a brilliant teacher or subject department within a less brilliant school (which we know there usually is), the ‘deep dive’ can tell us a great deal about the phenomenon we observe – the class, the teacher, the department – but less about the school overall. Or to put it another way, if inspectors conduct a deep dive into French, maths and history, will our judgments have any predictive validity in telling us what science, geography and art are likely to be like? If the answer is no, then the idea of a single school judgement of quality, collated along these lines, is fundamentally unsound.
Pedagogical styles determine the nature of work in books
Ofsted inspectors aren’t the only people to scrutinise students’ books; parents do. On a school visit to observe the use of a maths programme with mixed ability year 7 classes, the Head told me there had been complaints from parents about the maths work in the books of one particular class. On arrival in the classroom, it was easy to see why they were a bit sparse and scruffy. The teacher had a pedagogical style that was highly energetic, with chanting and quick fire mini whiteboard exercises. By contrast, when we walked into the parallel year 7 maths classes they were completing exactly the same curriculum programme using the more traditional approach of copying down a worked example and completing an exercise set. Now it’s not for me to say which lessons I preferred; I simply tell it as a story of why what is recorded in books reflects, not just the curriculum and its implementation, but the pedagogical choices we make too.
I fear that the requirement for books to show evidence of curriculum progression will curtail pedagogic choices that teachers can currently make. I have no issue with us having a national conversation about whether the inspectorate should have preferred pedagogic styles. My issue is having an inspectorate who declare you can teach how you like, provided they can see curriculum progression in exercise books.
We urgently need to know how reliable inspection is
It is great to see Ofsted doing some research into their inspection process. However, inspector inter-rater reliability – even if it were high (which it isn’t) – isn’t enough to tell us whether our inspection system is fit-for-purpose. We need a research project that compares current inspection ratings against ‘true’ measures of school quality. Difficulty is, who knows the truth?!?
This is what I’d like to do: select 20 schools that all have reasonably similar demographic profiles and for each one carry out a month long inspection with at least 10 subject specialists on site. During that month, inspectors spend time with leadership and in their subject departments observing lessons, talking to teachers, testing students, and so on, to work towards an overall judgement of school quality. Now, I am not suggesting that this would be true school quality, but it would be a fuller picture of school provision than the status quo. Just before this month-long inspection begins, we send in a different set of inspectors to carry out a standard Ofsted inspection, without reporting the judgment. At the end of the project, we would be able to compare the ranking of the schools under the conventional and the extended inspection. If rankings of the 20 schools were considerably different, this would suggest that our inspection system is not fit-for-purpose.