The book scrutiny monster

I thought I had nothing to say about Ofsted’s proposed use of book scrutiny because there is no research on book scrutiny (I wonder why?!?). As far I am concerned, even if Ofsted’s own research could show inspectors can rate books with 100% consistency, it would still be a terrible idea. The effect of the framework on daily school practice matters far more than our ability to judge schools reliably and ensuring books are ‘Ofsted ready’ at all times is so much worse than just having to keep your special ‘Ofsted lesson plan’ in your top drawer. At Teacher Tapp we’ve started keeping an eye on the book scrutiny monster. But since I had a couple of spare hours watching sports day today, I decided to read and reflect on the Ofsted book scrutiny research. Here’s my perspective on it.

Now, it is important to say up front that the research Ofsted carried out on book scrutiny has NOTHING to do with their latest ideas about the role of book scrutiny in inspection. Interpretation of the new framework is something of a moving object at the moment, but, as far as I understand it from Ofsted employees, book scrutiny will only be used to ensure curriculum compliance (i.e. you taught the things you said you would). They tell me it does not form part of the assessment of impact (it is a mystery how that is measured), but does contribute to the ‘quality of education’ judgment overall through something impenetrable called ‘triangulation’. Except when you read the Inspection handbook it does say that work scrutiny can be used ‘…to form of view of whether pupils know more and can do more…’ Confused? I certainly am.

Book scrutiny is a highly non-standardised form of assessment

The Ofsted research was carried out to assess whether inspectors can rate reliably using book scrutiny indicators, of which two are closely related to assessing whether pupils know more and can do more. We know the circumstances under which assessment is reliable – when it is a standardised assessment with known reliability of a standardised curriculum sat in standardised assessment conditions with a standardised appreciation of the stakes. Book scrutiny stands up poorly to these:

  • The rubric used in the research study relied on performance descriptors, something we’ve been trying to shake off in assessing learning elsewhere. The inspectors in the study admit that the language of ‘some‘, ‘sufficient‘, ‘considerable‘ is too vague to rate against. This is inevitable if we try to construct generic rubrics to be used across multiple subjects, curricula and age groups. It can’t be fixed by ‘better rubrics’ – they don’t exist. (See Daisy Christodoulou’s blogs for the problems with performance descriptors.)
  • We know that the reliability of book scrutiny is quite low for some of the indicators used in the research (which remember WON’T be used in inspection!!!). The report says they will try to resolve this through more subject documentation and training to help turn an English teacher into a science inspector. I look forward to learning more about that sorcery.
  • Observing books across different curricula increases the complexity of the challenge – it will be hard to make comparative judgments on how students learn about cell structure versus learning about light waves, yet this is what non-specialists will need to do. Equally, comparisons for the same student learning about different parts of the knowledge domain (e.g. studying first cells and then light waves) are far more difficult than comparisons within a domain (e.g. development in sentence construction over time).
  • The challenge in judging progress over time within an exercise book is considerable, as it is for measuring progress between two testing points. (It sounds like Ofsted don’t plan to do this, though it was part of the research. Clarity about book storage is urgently needed here.) Whether we like it or not, the progress of individual students is slow compared to the variability of achievement within the age cohort. And this is as true for progress in a book as it is for progress across two tests.

Book scrutiny is more like Year 6 writing moderation than it is Year 4 multiplication check. It is more like a BTEC portfolio than it is a GCSE examination. In our high-stakes accountability system, of which Ofsted is part, we know exactly what happens when we hold teachers to account for teacher-malleable, vague-rubric assessment approaches, whether they are to measure ‘implementation’ or to measure ‘impact’ or some other ‘i’ yet to be determined.

The more alike schools are, the greater the reliability challenge

I started messing about with simulation data on book scrutiny reliability to work out whether 0.5 reliability was acceptably high, but quickly stopped when I realised that we cannot say whether it is high enough to contribute to measuring school quality without knowing the variation in provision that we need inspectors to discriminate between. Our difficulty is that most variation in provision takes place within rather than across schools.

I don’t envy Ofsted in needing to measure school quality at speed. School quality is a latent construct that we can only ever observe through the cloudy and slightly warped mirrors of test scores, book scrutiny, lesson observations and conversations. The success of education policy in ensuring under-performing schools are dealt with makes the challenge ever greater. Imagine you are asked to select the shoe box with the greatest number of sweets in it, where all you can do is peer through a series of tiny pin-holes. How well can you do at selecting the one with the most sweets? That rather depends on whether shoe boxes have 56, 59 and 62 sweets in or whether they have 10, 26 and 58 sweets in. Our challenge is that schools are much more like the former.

Ofsted’s approach to making these nuanced judgments between mostly-not-so-dysfunctional schools involves aggregating or triangulating lots of pieces of pretty unreliable ‘evidence’ on school quality. Aggregation is the friend of reliability but how straightforward triangulation turns out to be for Ofsted rather depends on whether each of these unreliableborderline moderately reliable‘ measures (e.g. book scrutiny for a subject; a class observation; a conversation with the head or teacher) consistently measures the same underlying construct. To the extent that there is a brilliant teacher or subject department within a less brilliant school (which we know there usually is), the ‘deep dive’ can tell us a great deal about the phenomenon we observe – the class, the teacher, the department – but less about the school overall. Or to put it another way, if inspectors conduct a deep dive into French, maths and history, will our judgments have any predictive validity in telling us what science, geography and art are likely to be like? If the answer is no, then the idea of a single school judgement of quality, collated along these lines, is fundamentally unsound.

Pedagogical styles determine the nature of work in books

Ofsted inspectors aren’t the only people to scrutinise students’ books; parents do. On a school visit to observe the use of a maths programme with mixed ability year 7 classes, the Head told me there had been complaints from parents about the maths work in the books of one particular class. On arrival in the classroom, it was easy to see why they were a bit sparse and scruffy. The teacher had a pedagogical style that was highly energetic, with chanting and quick fire mini whiteboard exercises. By contrast, when we walked into the parallel year 7 maths classes they were completing exactly the same curriculum programme using the more traditional approach of copying down a worked example and completing an exercise set. Now it’s not for me to say which lessons I preferred; I simply tell it as a story of why what is recorded in books reflects, not just the curriculum and its implementation, but the pedagogical choices we make too.

I fear that the requirement for books to show evidence of curriculum progression will curtail pedagogic choices that teachers can currently make. I have no issue with us having a national conversation about whether the inspectorate should have preferred pedagogic styles. My issue is having an inspectorate who declare you can teach how you like, provided they can see curriculum progression in exercise books.

We urgently need to know how reliable inspection is

It is great to see Ofsted doing some research into their inspection process. However, inspector inter-rater reliability – even if it were high (which it isn’t) – isn’t enough to tell us whether our inspection system is fit-for-purpose. We need a research project that compares current inspection ratings against ‘true’ measures of school quality. Difficulty is, who knows the truth?!?

This is what I’d like to do: select 20 schools that all have reasonably similar demographic profiles and for each one carry out a month long inspection with at least 10 subject specialists on site. During that month, inspectors spend time with leadership and in their subject departments observing lessons, talking to teachers, testing students, and so on, to work towards an overall judgement of school quality. Now, I am not suggesting that this would be true school quality, but it would be a fuller picture of school provision than the status quo. Just before this month-long inspection begins, we send in a different set of inspectors to carry out a standard Ofsted inspection, without reporting the judgment. At the end of the project, we would be able to compare the ranking of the schools under the conventional and the extended inspection. If rankings of the 20 schools were considerably different, this would suggest that our inspection system is not fit-for-purpose.

Any volunteers?

If you are still keen to read more about book scrutiny, try Tom Richmond in Schools Week and Mark Enser in TES.

Don’t let ‘perfect’ become the enemy of ‘better’ in the revision of accountability metrics

Last week, Ed Dorrell wrote a strange editorial in TES called Why attaching excluded pupils’ results to their school won’t work‘. I say it was strange because he failed to address the major impediment to including off-rolled pupils in accountability metrics (i.e. finding them… for that, read on). There is no doubt that there are some complicated choices, trade-offs to consider, and new sets of undesirable behaviours that would arise from making schools accountable for the pupils they teach. However, we are not starting from a neutral position and everyone agrees that the current accountability measures reward schools who find a way to lose students from their roll AND that this is an increasing problem.

We have to act, and do so with the following principle in mind. When we construct accountability metrics, our primary goal is that they encourage schools to behave as we want them to – in their admissions, expulsions, and nature of education provisions. Ensuring the accountability metric fairly represents school quality should be second order, as tough as that feels to heads. (Why? I’ll write about it another time, but essentially the mechanisms by which having greater precision on estimates of school quality feed through to improved educational standards are pretty blunt.)

The question of how we should count a student when they leave the school roll or arrive part-way through school should be viewed through the lens of the school behaviours we’d like to invoke. We want the school to maximise the educational success of the community, including that student. We want them to remove students if they are disrupting others (and thus lowering likely GCSE scores). If schools do remove them, we want to them to take an interest in ensuring that student then transfers to another school or alternative provision, rather than encouraging them to be ‘home schooled’. If the student is not disrupting the school community, and is more likely to be successful at their school than elsewhere, then barring specific circumstances (e.g. breaking the law or serious school rules), we want them to retain the student.

Ed poses a set of questions that suggest it is ‘unfeasibly complicated and impossible’ to explain and police a measure of Progress 8 that weights student results according to which of the 15 terms of secondary education they had spent in their secondary school (or equivalent in a middle school system). He is correct in the sense that there are literally an infinite number of choices we have to make – as there were when we made decisions about how to create the current Progress 8, incidentially. But choice from an infinite choice set is not ‘unfeasibly complicated and impossible’. All we need to do is pick an algorithm – any algorithm – that produces BETTER behaviours than we currently have. Again, whether or not it represents precisely how ‘good’ the school is isn’t our primary consideration.

Here are four alternative choices:

  1. Status quo = each student present in Spring y11 (term 14 out of 15) is weighted with a value of 1. The behavioural response is well known – there is a strong incentive to find a way to remove from the school roll any student who is likely to have a strongly negative progress score.
  2. Year 7 base option = schools are held accountable for the results of those who were admitted, regardless of whether they complete their education there. The advantage is that this produces a strong incentive to maximise the exam outcomes of each student who is enrolled, whether on roll or off roll. The disadvantage is that students admitted after year 7 will not count and so there is no need to maximise their GCSE outcomes. That said, it would encourage schools to feel comfortable in accepting previously excluded students from other schools as a fresh start, knowing that they will not be penalised in their performance table metrics.
  3. FFT Education Datalab proposal = each student is weighted according to the number of terms they spent at the school. This means schools would need to consider the best needs of every student that passes through the school, whether they are still on-roll or not. However, it does create an incentive to accelerate moving off-roll any student that is struggling. Would this produce large numbers of students being moved off roll during years 7 and 8 in a manner that is worse than current practice? This is judgement call.
  4. Ever seen option = every student who appears at any stage in the school is weighted with a value of 1 (this is the unweighted version of the FFT Education Datalab proposal). This fixes the problem with the weighted method whereby off-rolling early is better than off-rolling late. However, it doesn’t fix the current incentive to avoid taking on previously excluded students from other schools to give them a fresh start.

All other options (e.g. ever seen since year 9; weighting ks4 terms more than ks3 terms; etc…) can be viewed as a minor deviation from the above in terms of the types of behaviour they induce.

These adjustments to a progress score DO NOT ‘disproportionately punish the majority of schools – those who strive to get the best out of even the most challenging students and for whom exclusion is a last resort’. Quite the opposite; they are less punishing to those schools who take on previously excluded students to give them a second chance in mainstream education.

As FFT Education Datalab showed, the vast majority of schools who have pupils come and go as normal should not worry since any modification to Progress 8 will not materially affect them. It is only high-mobility schools where there are substantial differences between the types of students who leave the school and the types of students who arrive at the school that are likely to be affected.

None of this is difficult to implement and progress measure tweaks are entirely independent of issues around the commissioning of alternative provision. The termly pupil census means we can straightforwardly calculate progress figures and match pupils to their examinations data.

That said, Ed’s piece fails to identify the one major impediment to holding schools accountable for pupils they taught, even after they left. There are two situations where we do not want to hold them accountable for a student who receives no GCSE examination results: (1) where that student left the country; and (2) where the student died or was unable to complete their education for serious medical reasons. When students disappear from all school and examination records, central government does not know the reason why because we have no census which covers children outside of education. School accountability isn’t a good enough reason to set-up a full annual census of children, using GP records as a starting point. But, given the rise in ‘home schooling’ where parents are not even present to educate the teenager, there are very clear safeguarding reasons why it is time to look again at introducing one.

In the meantime, I don’t see concerns about death and migration as so material that we should continue with the damaging incentives set up by Progress 8 which currently allows substantial off-rolling without consequence.

Remember: there isn’t a world in which accountability is perfect, but there are many accountability measures that are better than the status quo.

Writing the rules of the grading game (part III): There is no value-neutral approach to giving feedback

These three blogs (part I, part II, part III here) are based on a talk I gave at Headteachers’ Roundtable Summit in March 2019. My thoughts on this topic have been extensively shaped by conversations with Ben White, a psychology teacher in Kent. Neither of us yet know what we think!

Our beliefs about our academic ability are often so tightly intertwined with our sense of self that we must take care in how we talk about it. It is only with a clear mental model of how feedback might alter a parent or student’s beliefs and goals that we can understand how to manage risks and enhance potential gains involved in communicating attainment. Inducing competitive behaviour can be enormously helpful in encouraging student effort in these situations where many of the benefits of learning are long-term and poorly appreciated by the learner. People tend to be highly motivated by facing up to social comparisons, and students will make these comparisons whether you give them ranking information or not. However, the mental model I’ve described suggests that pushing the competitive focus too far risks lowering effort or pushing students off the game, especially where they feel there is no prospect of doing well.

Risks in giving clear, cohort-referenced feedback

The model implies there are three situations where giving parents or students clear cohort or nationally-referenced feedback can lower future effort. Firstly, if they receive unexpectedly positive feedback, it could lead to complacency about effort required in the future. In fact, the simple act of learning your place on the curve with greater certainty than you had before could even be unhelpful if it gives you greater comfort that you are where you want to be. Secondly, there is a risk of demoralisation if you come to believe that the effort needed for a small improvement in ranking isn’t worth it. Thirdly, game-switching is a risk if you decide you can achieve better returns by working at something else. I think all these risks, but particularly the third, are deeply culturally situated and framed by how you communicate the value of achievement and hard work. Only you can know whether you have created a school climate where you can keep all your students playing the grading game you create for them.

Risks in giving kind and fuzzy feedback

So what are the risks around the alternative – the fuzzy language we use to talk about attainment? Teachers often suffer from rater leniency (being a little too generous with student grading) when there is room for subjectivity. The mental model shows why ‘kind’ feedback might – or might not – be so helpful. On the one hand, the model highlights why we might want to be lenient. By saying “you’ve learnt a lot”, we are hoping students feel more confident in their ability to learn in the future. This is unambiguously a good thing. Also, by saying “you’ve learnt a lot”, we are hoping to keep vulnerable students engaged in our grading game. However, leniency in the rating of a skill level can equally reduce motivation as it may signal the student has already done enough to get to the position they’d like to be in. Hence, while raising confidence in the ability to acquire a certain skill or achieve an outcome can be beneficial, raising confidence in the skill itself or the level of past achievements can be detrimental.

Withholding clear attainment information from parents and students can also be damaging if it de-prioritises YOUR game in their minds and fails to give them the information they need to ensure they maintain the position on the bell curve that they would like to achieve. Primary schools are the masters of ‘fuzzy’ feedback. I suspect a majority of primary parents are told their child is ‘as expected’ in most schools, yet these parents would respond quite differently to discovering their child was ranked 7/30 versus 23/30 in class. What mental model of the beliefs, capabilities and desires of that child and their parents leads primary schools to believe it is in the family’s interest to withhold clear attainment information? There is nice evidence from elsewhere in the world that shows powerful effects of communicating frequent and transparent attainment grades with parents of younger children. It would be great to have a trial in this country to learn which of our families respond to this information.

Downplaying the importance of prior attainment in the game

One unambiguous finding from this literature is the importance of trying to maintain strong student beliefs in their ability to climb up the rankings through making an effort. The teacher’s dilemma is how to maintain strong beliefs that learning is productive (i.e. you can do this) without telling students they’ve already done well enough (i.e. you’re already there). One implication is that you need to construct a game where feedback scores are truly responsive to changes in effort.

The problem we face is that performance in a test is frequently more strongly determined by prior knowledge/IQ than by recent student effort. Students are frequently rational in appreciating that effort yields few rewards. One approach to lessening the anchoring effect of prior knowledge is to encourage comparisons between students with similar prior attainment. For example, if a school has subject ability-setting, then within-class comparisons promote a competition where effort is more strongly rewarded than do whole-school comparisons. This approach will only be effective though if students ‘buy into’ the within-class games you’ve constructed, of course.

I am frequently asked why we need make comparisons with other students at all. Asking a student to compete with their own past performance avoids many of the problems I have discussed, though we tend to be less motivated by competition with ourselves! Ipsative feedback compares performance in the same assessment of the same domain over time and we frequently use it outside school settings (e.g. my 5km speed this week compared to last). It would be great to see these ipsative comparisons encouraged more in schools, but there are good reasons why their application is limited. When we teach we tend to continuously expand the knowledge domain we wish to assess, making ipsative comparisons less straightforward (except in restricted domains such as times tables). (And since everyone in the class must plough on with learning the curriculum at the same speed, regardless of whether they are ready, mastery assessment approaches where we accumulate a list of competencies as they are reached also aren’t very practical). I think creating strange ‘progress’ measures, fudging within-student comparisons between non-standardised tests from one term to the next, are attempts to encourage students to make these comparisons with themselves. They can certainly be justified by the mental model described in the previous post. (For what it’s worth, though, I am not really convinced they are credible metrics in a game that students actually care about.)

A meaningful game where there are more winners than losers over time

If you want to construct a game where making an effort typically yields a decent return, why on earth would you make it a zero-sum game – as ranking does – where half your class will necessarily be losers despite making an effort? One initial step to avoid this is to create benchmarks that are external to the school, i.e. national reference points (invented, if necessarily), to remove the requirement for there to be losers.

National-benchmarking, such as using standardised scores, still doesn’t ensure effort is typically rewarded though, unless your school happens to be able to outpace the national benchmark. To do this, your system for describing attainment could invent typical grades or levels that rise a little each year as students move through the school, generating a sense that pupils are getting better at your game. And so, our mental model starts to explain why schools invent and re-invent arbitrary levels systems!

But our mental model also asserts that we need our game to feel meaningful to students, with rewards that they value (otherwise they won’t feel inclined to work hard at it). Achieving a ‘Level Triangle’ (or whatever you choose to call it) might not feel meaningful enough to some. Is this how you settle on the idea of using GCSE grades as your levels system, since we know they are a grade which has motivational currency to students? Why not invent fictional internal scales and tell students they are at a GCSE grade 2 in Year 7, grade 3 in Year 8, and so on? Of course, technically this is a nonsense – a 12 year old who hasn’t studied the GCSE specification cannot possibly be a GCSE grade anything.

We find ourselves creating meaningless games, but ones that might be worthwhile because they have better motivational properties than any other game we could invent for our students to play! I hate meaningless data, but I’d find it hard to argue that schools shouldn’t use it if they could demonstrate it increased student effort.

The curious flightpath games we play

This gives us a new perspective on trying to make sense of the flightpath, a 5-year board game where individual students are asked to keep on, or beat, the path we’ve set up for them. It is easy to be dismissive of this game on grounds of its validity, especially when we know how little longitudinal attainment data conforms to the paths we create. But we should also ask whether it is more or less motivating than the termly ranking game or any other grading game we could give them as an alternative.

I’m pretty sure the standard, fixed flightpath that maps student attainment from Key Stage Two data, and is impervious to effort or new information, has poor motivational properties. The motivational properties are poor for those students who discover they are on track for a Grade 3, before they’ve had a chance to work hard in secondary school. They might also be poor for those who are told that, in all likelihood, they’ll attain a Grade 8 or 9. The game card prioritises the signal of prior attainment (bad for motivation) and underplays the importance of effort in reaching any desired goal.

But what about the schools who use dynamic flightpaths, updating each students game card each term or year in light of new effort and attainment information? Suppose that, at all times, the game card also shows the non-zero probability of any grade being attained in the future to signal the importance of effort in achieving goals. Is it possible that this type of dynamic grading can help students create a game it is possible to do well at, preserving useful beliefs about effort being productive, whilst also signalling that more effort is needed to get to the next position they’d like to attain?

This is all speculation – there isn’t any research out there that can tell you the impact of using target grades, predictions or flightpaths in different types of schools. All we can do is to invoke mental models to think through how they might affect motivation (one nice speculation about target grades is by James Theo).

The ethics of telling un-truths

When we construct grading games that prioritise manipulating behavioural responses over the whole-truth about attainment, we have to face up to tricky ethical dilemmas. We face these all the time in schools when we tell the half-truths we do to parents and students about attainment; the exploration of mental models simply makes it more explicit why we do it, and who we might help or damage in the process.

Mental models also make it explicit that one grading system will not suit all students equally well. Slightly over-confident, competitively minded students who are able to figure out how to translate effort into learning would do well in a pure rankings system. They will have classmates who find competition stressful and, even with considerable effort, risk slipping behind each year for reasons entirely outside their control. Those researchers who showed that cohort-referenced grades can improve school exam results also showed they increased inequality in happiness amongst students overall. If there are trade-offs, whose welfare do we prioritise?


Choosing how to give attainment feedback to students and their parents is a minefield, but I hope by now you appreciate that choosing NOT to give clear, interpretable (i.e. often norm-referenced) feedback on how a student is doing is not a neutral position to take. It can be damaging to the motivation of certain students under certain circumstances, and you need a clear mental framework to understand why this happens.

Equally, validity of inference should not be the only concern in working out how you are going to report attainment at school. Systems that look bizarre on the face of it, such as flightpaths, might have an intelligible approach to motivating and managing students’ complex belief systems.

If, on getting to the end of these posts, you feel utterly confused about what it is right to do, I think that’s OK. We can be pretty sure that choosing your grading system isn’t the most important decision a leadership team makes. It is true that many of these studies identify a costless and significantly positive effect of giving attainment feedback, particularly at points in time where the stakes are high or where attainment is not yet well known. However, the overall impact of a change in attainment reporting on end-of-school outcomes will typically be quite small, on average.

Nobody can tell you how you should construct your own grading game. The findings of the literature are inconsistent because the mental models of how we are trying to change student beliefs are very complex. How your students will respond to your grading system through your manipulation of their belief systems strongly depends on your school culture and on localised social norms amongst peers. The best you can do is take the time to learn what students believe to be true about themselves – both in their current attainment and in their capability to learn and progress. It is these existing beliefs that students hold about themselves that give you a clue as to how they might respond to your grading game.

Good luck with writing the rules of your grading game (it’s not easy)!

Writing the rules of the grading game (part II): The games children play

These three blogs (part I, part II here, part III) are based on a talk I gave at Headteachers’ Roundtable Summit in March 2019. My thoughts on this topic have been extensively shaped by conversations with Ben White, a psychology teacher in Kent. Neither of us yet know what we think!

The two fundamental jobs that children need to do are to feel successful and to have friends – every day. Sure, they could hire school to get these jobs done. Some achieve success and friends in the classroom, the band, the math club, or the basketball team. But to feel successful and have friends, they could also drop out of school and join a gang, or buy a car and cruise the streets. Viewed from the perspective of jobs, it becomes very clear that schools don’t often do these jobs well at all – in fact, all too often, schools are structured to help most students feel like failures.

Clayton M. Christensen, James Allworth and Karen Dillon
How Will You Measure Your Life?

Where do we believe we are on the grading curve?

Whether we like it or not, from a very young age students start to develop an idea of where they sit on a bell curve of attainment, relative to their peers. All that cohort-referenced feedback does is to give students new information about how well they are doing in the ‘game’ of trying to climb up the bell curve. If only it were so simple that teachers could encourage students to make an effort in this ‘game’ by handing out regular cohort-referenced feedback! Though the examples in my first post showed that it’s a good bet that introducing cohort-referencing will raise effort, on average, other studies show it is risky strategy because it can alter beliefs in quite unhelpful ways.

How can this happen? Surely everyone wants to work hard to get as far up the ladder as they can? This blog post sets out the mental model that explains three dimensions of each of your student’s belief systems that you must consider when thinking through how they respond to learning their attainment.

Beliefs about where we sit on the curve

There is pretty convincing evidence that, rather than constantly striving to be the best, we instead tend to prioritise self-esteem maintenance or keeping ourselves at the place in the hierarchy to which we have become accustom! (Remember we are having to juggle multiple games in life so choice about where to direct effort is sometimes necessary.) This means that the effect of attainment feedback rather depends on how it allows us to update our prior ideas about position on the bell curve (or at least lessen the fuzziness we have about how well we are doing).

Where a mark or grade received simply confirms our prior view of how well we were doing, the act of receiving the feedback might have no impact because we have no need to adjust behaviours to close the gap between our current performance and the internal standard we had in mind for ourselves.

This was demonstrated in a nice experiment, albeit one on undergraduates, which provided students with their position in the grade distribution every six months. The researchers found that giving this grading feedback actually LOWERED academic outcomes more frequently than it raised them! How can this be? Well, it turns out that, in the absence of knowing their rank, many of these students had actually underestimated how well they were doing. When they received the good news they were doing better than they thought, their (self-reported) satisfaction increased and they ramped down the effort they were putting into their studies. There was a smaller group for whom the reverse was true – they had overestimated their position in the year and so receiving the bad news of their true ranking caused them to increase their effort.

England’s long experiment with giving Year 12 students additional nationally-benchmarked feedback in the form of AS exam results was mirrored by a nicely-evaluated Greek policy experiment that introduced a nationally-benchmarked penultimate year exam. The strongest predictor of your response to this type of additional penultimate year feedback was whether it gave you a positive or negative surprise.

Adapted from
Goulas and Megalokonomou (2015)
  • Negative surprise: Learning that your 11th grade performance was worse than for other students with a similar 10th grade scores to you generally led to greater effort for final year exams, regardless of your prior attainment (shown in green).
  • Positive surprise: Learning you did better in 11th grade than would be suggested by your 10th grade position led to less effort for final year exams (red on the chart).

This finding that prior over or under confidence in attainment is central to how we respond to feedback is an incredibly important consideration when trying to second-guess how your students might respond to the feedback you give them. If they find out they are doing worse than they thought, they tend to pull your finger out! If they find out they are doing better than they thought, they tend to step off the gas! It is prior beliefs in attainment, relative to actual attainment, that frequently predict a student’s response (and this finding is replicated in many studies).

Now, it’s worth saying that, although this wasn’t true for the university grading study described above, people tend to be a little over-optimistic about how well they are doing [£] so giving clearer feedback tends to be more helpful than not in pushing people to make an effort. However, this doesn’t mean that getting bad news is ALWAYS useful. To understand why, we need to look at how feedback shapes other beliefs we hold about ourselves.

Beliefs about productivity of effort

So far, we’ve learnt how feedback shapes the beliefs we hold about our current performance. However, it also shapes our beliefs about our ability to learn or climb up the rankings. Receiving a grade is small part of a feedback loop, all the pieces of which must be in place for it to produce a useful behavioural response. The student needs to believe it is possible to play the game, i.e. that their effort can productively translate into learning that in turn translates into performance in the next test. (Where students do not feel they have the knowledge and capacity to translate effort into learning then decisions about how to convey attainment are irrelevant.)

All teachers know how important it is for students to maintain self-efficacy in order to persevere over extended periods of time at school, often without short-term rewards. Teachers will also know that, whilst we can be sure that grit and growth mindset are malleable characteristics, particularly in young children, it has proved difficult to construct reliable interventions to improve mindset. Unfortunately for us in designing grading feedback, it can often simultaneously influence beliefs in performance and beliefs in ability to learn such that they work against each other. Learning you’ve been unexpectedly successful in a test enforces a belief in your ability to learn, yet dampens your desire to do so!

Feedback can, correctly or incorrectly, seriously impair our belief in the productivity of our effort – what Dweck (1986) calls ‘learned helplessness’. So, you’ve found out you are towards the bottom of the bell curve. You’ve come to believe that even a huge amount of effort can’t get you to a much better position. Then what? Well, you might be OK with reducing your own internal standard of how good you want to be and continue to take part in the grading game…. but it’s understandable that it doesn’t appeal to everyone.

Choosing to play your grading game

Young dropouts are often behaving quite as rationally as their more successful peers. If you are fourteen or fifteen and well behind academically, your chances of catching up start to look very very slim. And if you are not going to improve your relative position much, even if you change your whole lifestyle, then why not give up on the academic competition completely? It is likely to be better for your self-respect and, at least in the short term, not obviously worse for your other prospects: time spent in school when it isn’t going to improve your job opportunities, and you don’t enjoy the work, is just time wasted.

Alison Wolf
Does Education Matter?

Nobody likes to compete in a game they know they will do badly in, regardless of how hard they work. This is particularly true in education, where our ability can be tightly intertwined with our sense of self and how other people perceive us. Empirical research shows that in addition to liking to be at the top of the pile, we are also strongly “last-place” averse. If avoiding the bottom of the ladder is a powerful motivator in humans, how are your students going to try to avoid it? By trying to climb up? Or by trying to climb off?

Even where performance in the game is kept entirely private, there is good evidence that students develop strategies to avoid learning their ranking. In experiments, those who suspect they are lower performers are more likely to avoid learning their rank in class (e.g. here), thus refusing to take part in the grading game. Even worse is the phenomenon of ‘self-handicapping’ – deliberately withdrawing effort where there is potential for self-image-damaging feedback. For example, staying up late gaming with friends before an important exam allows us to attribute poor exam performance to the tiredness rather than low ability.

Where rankings are made public, our responses to them will be strongly shaped by adherence to social norms and the preservation of our social reputation. I don’t want to get into reviewing the literature on award-giving (here is a nice summary), but it makes it clear that responses are heavily framed by peer context. Awards can be motivating if handed out with care in the right circumstances (i.e. sparingly, in just-circumstances, exploiting the bond of loyalty between giver and receiver, but with consideration of effects on non-recipients). However, if we choose to make academic effort observable to peers by publishing grades or awards, students may act to avoid social penalties or gain social favour by conforming to prevailing norms.

In the US, academics Bursztyn and Jensen have run school experiments highlighting how these risks play out with students. They introduced a performance-based leaderboard for an online learning platform which announced the top three performers in the classroom, in the school, and overall on the platform. This publication of the leaderboard led to a decrease in student performance on the platform, which was primarily driven by a decline in effort provided by students who were top performers prior to the introduction of the leaderboard. This suggests it created a fear of peer social sanction for high performers (by contrast, if anything, the performance of those at the bottom of the distribution slightly improved).

In another experiment, they showed that whether or not you sign-up to a voluntary test preparation course is affected by whether your sign-up is made public. In classrooms with high attaining students taking advanced classes, sign-up rates were unaffected by whether the enrolment decision was public. It was in the classrooms with less high attaining students that enrolment rates were much lower with a public sign-up system. Social norms of the classroom matter. (A nice feature of the study was their ability to observe the same student sitting alongside high achieving peers in one classroom and alongside more mixed peers in another – think about a scenario where a student is in ‘top set’ for one subject and mixed ability class for another. These same students were negatively affected by the sign-up sheets being made public in the more mixed class but were not in the high-attaining class.)

The mental model of responses to grades for your students

Handing out clear, cohort or nationally-referenced feedback to students or their parents on how they are performing has the potential to be a powerful motivator, but comes with risks. Knowing a student’s ability or attainment cannot help you predict how they will respond. What matters is how the grading information:

  • Changes their beliefs about their attainment
  • Changes their beliefs about their ability to learn and get better
  • Changes their desire to keep playing the competition of trying to be the best, or maintain their position, or avoid the bottom rung

Every teacher has to give some feedback on attainment and there is no risk-free or value-neutral approach to doing it. I hope you can think about the mental model I’ve described in relation to students you know. Have you ever asked them, for example, how well they think they are doing before you hand out their grades? Who is suffering from ‘learned helplessness‘ and is there anything you can do to counter that? Which five students you teach are most likely to respond negatively to learning their attainment in a more transparent way?

Schools do quite bizarre things to avoid telling students their position on the attainment bell curve. Bizarre, yes, but not necessarily unhelpful. In the final blog post, I will use this mental model to describe why schools end up creating these strange grading games!

You can find part III here.

Writing the rules of the grading game (part I): The grade changes the child

These three blogs (part I here, part II, part III) are based on a talk I gave at Headteachers’ Roundtable Summit in March 2019. My thoughts on this topic have been extensively shaped by conversations with Ben White, a psychology teacher in Kent. Neither of us yet know what we think!

Teachers are rarely trained in how to give formal, summative feedback to students and their parents – following a test, in a school report, or in a conversation at parents’ evening. We instinctively form views of how a student is doing in relation to others we teach, yet when we report attainment we frequently translate these relative perspectives into some sort of strange scale that it is hard to interpret.

“Stuart is performing at a Level 2W”
“Mark achieved a Grade B+”
“James scored 68% in the class test”
“Laura has been award a Grade Triangle”

Grades, levels, percentages, arbitrary notions of expected, targets, progress measures, GCSE grades given in year 7, flightpaths… Why do you make it so complicated when the truth about attainment is really quite simple?

I want to explore the arguments of the ‘truth-advocates’ out there. There are those who think we should simply tell students what we can validly infer from the end-of-term test we write and administer in our school. Teacher-bloggers such as Mark Enser and Matthew Benyohai rightly point out that when we provide marks without norm-referencing information about how others have performed, our so-called grade actually provides them with little information (of course, students scramble around class to figure out this cohort-referenced information themselves). A ‘Grade B’ isn’t inherently a phrase that contains meaning; its meaning for students arises through knowledge of the distribution of grades awarded. The ‘cleanest’ version of cohort-referencing – just handing out class or school rankings – seems to be quite commonplace in certain secondary school subject departments today, according to this Teacher Tapp sample. However, are there considerations beyond validity of inference we should consider before handing out rankings or grades?

The second set of ‘truth-advocates’ who like school-rankings prioritise attainment feedback’s role in an educational process where they are hoping to change student behaviours. You can read Deborah Hawkins’ blog about how termly rank order assessments are used at Glenmoor and Winton Academies to induce student effort. By creating a termly ranking game, they recognise the challenges schools face in trying to persuade students to spend time on their game – the game of getting good at maths or French – rather than the other games of life – being popular, playing sport, pursuing new relationships, and so on. Creating a game where students are induced to work harder to climb up the bell curve of achievement is potentially so powerful because we care a great deal about our place in social hierarchies. The economist Adam Smith (1759) once said, “rank among our equals, is, perhaps, the strongest of all our desires”. Biological research has shown that high rank is often associated with high concentrations of serotonin, a neurotransmitter in the brain that enhances feelings of well-being. What’s more, social comparisons are an indispensable part of bonding among adolescents.

It is easy to find academic studies that back up the grading policies of schools that report rankings or cohort-referenced grades. For example, a school in Spain experimented with giving students ‘grade-curving’ information, alongside the grades they had always received (e.g. supplementing the news of receiving a Grade C with information that the class average was a Grade B+). The provision of this cohort-referenced information led to an increase of 5% in students’ grades and the effect was significant for both low and high attainers. When the information was removed the following year, the effect disappeared. Similarly, a randomised trial on more than 1,000 sixth graders in Swedish primary schools found student performance was significantly higher with relative grading than with standard absolute grading. These positive effects of cohort-referencing are mirrored in numerous university field and lab experiments (e.g. here and here).

So, why don’t we all follow the ‘truth-advocates’ and give students clear, cohort or nationally-referenced feedback on how they are doing, allowing them to compete with their peers? We avoid this feedback, of course, because we are nervous about how our students will respond to it. Whether we are conscious of it or not, we all have a mental model of how we hope reporting attainment might change student behaviours. It is these implicit, mental models that explain why we might tell a half-truth to a student or parent, assuring them they are doing fine when the opposite is true. Mental models explain what meaning we hope to convey when we tell a student they have 68% in a test or why we allow students a few minutes to compare their marked papers with others in class. They explain why we send quite odd ‘tracking’ data home to parents, and are privately quite content that it doesn’t allow them to infer whether their child is doing better or worse than average.

In the blog posts that follow, I hope to persuade you of the importance of developing a clear mental model of how your students might respond to learning their attainment. I am collating together a messy empirical literature from education, behavioural psychology and economics, which lacks a consistent theoretical footing from which to build generalisable findings. Having read these studies, I think it is most useful for teachers to develop mental models that emphasise changes to the students beliefs about themselves. Beliefs often fulfil important psychological and functional needs of the individual [£]. This literature emphasises three dimensions to describe how grading feedback affects student behaviour:[i]

  1. The effect on student beliefs about their attainment
  2. The effect on student beliefs about their ability to learn
  3. The effect on their willingness to play the game you want them to play

Talking about attainment is something that no teacher can avoid, and choosing to use fuzzy and ambiguous language with parents and students is not a value-neutral approach (for reasons that will become clear by blog three). Neither is telling children the whole truth about how they are performing. For so much of our time as teachers we talk about how the child can change the grade they receive. In these posts, we will be talking about how the grade can change the child!

You can find part II here.

[i] Note – this differs somewhat from the favoured feedback model of educationalists – Kluger and DeNisi’s (1996) Feedback Intervention Theory – which seems particularly pertinent to predicting mechanistic responses to task-based feedback but isn’t well-aligned with the disciplinary traditions of the empirical research I am reviewing here.

When would you like to be in a smaller class: age 5 or age 15?

Question: What links GCSE Design and Technology* with my 4 year old’s class size?

Answer: Money. And the choices we’ve made about how to spend it.

We’ve made the strangest resourcing choices in England, although it is so ingrained in our societal norms that is hard for us to recognise it. Children start at age 4 – younger than most other countries in the world – and from day one we place them into classes that are enormous, by international standards. It is common for us to have reception classes of 30 – OECD data reports the average in primary state-funded schools is 27.1. In other OECD countries primary class sizes are typically around 21 students.

Now, of course we aren’t the only country to have large primary classes, but we are distinctive in that our class sizes shrink considerably as children get older. Class sizes is one of the primary drivers of school funding demands across phases of education. Nobody else makes the same relative funding choices as us.


In most countries class sizes tend to grow as students get older. Perhaps they form a judgement that older children are able to cope in larger classes. Perhaps they feel that smaller classes are needed for pedagogical approaches more widely used with younger children.

(Now, it’s true that a Reception Year class nearly always has a full-time teaching assistant, but they aren’t trained teachers and it doesn’t compensate for limited physical space that becomes easily jammed as 4 year-olds shuffle between activities. By the age of 8 the full-time teaching assistants are long gone in most schools.)


So what are we buying with the cash we’ve saved through these large primary classes? If I told you secondary class sizes are, on average, under 21 students (OECD figures again), would you believe me? Many secondary teachers I speak to are surprised by this low figure because it doesn’t resonate with the classes they teach. The cash doesn’t reduce class sizes across the board. Instead, it is used to buy us two different things.

First, many schools run tiny lower attainment sets in core subjects. This makes sense for these students, who are struggling to access the GCSE curriculum and I am pleased schools make this resourcing decision. However, we have to ask how resourcing deployment contributed to them arriving unable to access the secondary curriculum at age 11 in the first place. Is it optimal to deliver tiny maths class sizes at age 14 (as we do) or at age 8? I couldn’t tell you, but this is a testable question (and evidence on the benefits of small class sizes suggests deploying them in younger age groups is optimal).

Second, we are relatively unusual by international standards because our elective curriculum starts from age 14 (or even 13). For most schools this means delivering partly empty classes, since students rarely select optional subjects in neat multiples of 30. If you try to preserve free choice, but require class sizes to rise as has happened during austerity, then schools inevitably abandon subjects that I personally think are important, e.g. languages and music.

Most people (except Michael Fordham and I) seem to like subject choice. Students love giving up subjects (or teachers!) they dislike. Teachers like losing students who are disinterested in their subject. As someone who is neither a teacher nor a student any more, I am intrigued by the arguments teachers make for ever earlier curriculum restriction. The new GCSE Geography curriculum may indeed be so deep and complex that it requires a three year programme of study, and yet geography doesn’t appear to be so important that all our future citizens have a right to study it to age 16 (or indeed 14 in many schools now)!

This is not a ‘GCSE reform is due’ blogpost – I’d just like us to talk more about all the trade-offs we make when we allow subject choice at Key Stage 4. Giving greater depth to optional subjects through long GCSE programmes comes with two major costs. It removes study time in the subjects for those who do not continue it after Key Stage 3 and it requires us to fund the smaller class sizes that inevitably arise.

I know… I know… you are thinking you’d like to preserve the status-quo in secondaries of small Key Stage 4 class sizes and reduce class sizes in infants. But resourcing education is all about trade offs, and often these trade offs need to happen within education, rather than between education and other parts of the economy. If we want smaller classes in infants then we have to think about whether we are prepared to give up anything else to achieve it.

This isn’t an adjustment that the education system could ever make on its own because it means taking cash away from some schools and giving it to others. Why would secondary teachers sign up for a reform that delivers larger class sizes that include students who would give up studying their subject given half the chance? I’m not saying we should enact this re-distributive policy, but I’d like us to have a fuller conversation about what sort of evidence would help us to make optimal resourcing decisions across phases.

*…or any other GCSE optional subject

It’s not (just) what teachers know, it’s who teachers know

I have been talking to many teachers and school leaders recently about what information needs to be recorded, whether in a markbook or in a centralised system, for a teacher to teach effectively. The answer is, partly, that it depends on what information the teacher is able to hold in their head, without the need for taking notes! A primary school teacher who spends 25 hours a week with the same 30 children has a rather easier job here than the secondary school music teacher who sees over 300 students pass through their classroom each week.

I recently dropped in to see a school where the need for written documentation was about as low as it is possible for it to be: a one-form entry primary school, very low family mobility, stable teaching staff, a headteacher that can know all the students and their parents by name. We spoke for some time about what he thinks his teachers need to ‘know’ about students to do their job – the importance of nuanced views of what a child can already do, how difficult students are likely to find a new task they encounter and how best to engage the child in learning. We spoke about how he thinks teachers accumulate this information about their students over the course of a year, and what is lost when students move to a new class in September.

He then mentioned in passing that they had decided to keep a number of classes with the same teacher last September. ‘So powerful to start the year already knowing your students!’, he said. In reply, I told him about a recent US research study that backs up his intuition. American educators call the policy of keeping students with the same teacher for a second academic year looping – what a great phrase! The study (£) in elementary schools showed small academic gains from keeping students with the same teacher for a second year. It is important to note that the effect size here isn’t massive, but in education policy we are almost always in the business of marginal gains.

Of course, the popularity of looping rather depends on having a pool of consistently effective teachers. In my family, we often still talk about one person’s disastrous three-year ‘looping’ experience as an infant pupil with an ineffective teacher. Looping practice for up to eight years in Steiner (Waldorf) Schools is said to lead to parents removing their children en masse from one class if they aren’t happy with their teacher.

Studies on the benefits of looping serve to remind us about the importance of teachers knowing their classes. Secondary schools make an effort to loop in years 10 and 11, but perhaps they should seek to extend this looping back into the younger years too. Other practices, such as ensuring form tutors also get to teach their classes or minimising incidences of ‘split’ classes, would seem to be important too but are increasingly hard to achieve where tight budgets leave no flexibility in timetabling arrangements. More controversially, it highlights one difficulty with job share arrangements in primary schools, where the part-time teachers necessarily take longer to get to know their class at the start of the year.

Other commentators have rightly drawn parallels with another study where elementary school teachers specialised in (usually) two of maths, science, English or social science and taught these subjects across multiple classes. The effect of this subject specialism was to lower pupil achievement. The author reported that “… teacher specialization, if anything, decreases student achievement, decreases student attendance, and increases student behavioural problems.”

Now, this wasn’t proper subject specialism that included training to become specialists: headteachers at each school simply helped to identify who should be allocated to specialise in which subject. That said, I interpret this second study as showing that we might want to think again about the trade-offs between having teachers who are subject experts, able to benefit from both disciplinary expertise and repeating the same lessons, and teachers who are experts in the students they teach. It is inevitably hard to be an expert in both.

In England, our schooling careers are U-shaped with respect to whether teachers know us well or not. Our youngest and oldest students benefit from a few teachers who get to know them very well. By contrast, between the ages of 11 and 14, students troop between a dozen different teachers each week. Are we sure we always get the trade-offs right? For example, how did we decide it was optimal to have one generalist teacher for ten year-olds, followed by ten subject specialists for eleven year-olds? Did middle schools who chose to run a core part of each teaching day with the form teacher get something right? And whilst history teachers love not having to teach religious studies and physics teachers love not having to teach biology, how far should a school fractionalise a pupil’s timetable before it becomes damaging to their academic and pastoral experience? These are empirical questions that we cannot yet answer.

It is great that so much policy energy has been focused on a more sophisticated understanding of the curriculum, of what makes subject knowledge domains distinctive and of what this implies for subject-specialist pedagogy. We should harness this sophisticated understanding of the number of hours it would take to train a teacher to deliver a particular curriculum, to a particular age group, with resources we may or may not have provided for them, to think deeply about whether we’ve always got the trade-offs right between becoming specialists in subjects or specialists in children.