Poor attainment data often comes too late!

It’s time to get positive about data. The right kind of data.

In my blogpost on the question of why we cannot easily measure progress, I explained why short, one-hour tests are rarely reliable enough to tell us anything interesting about whether or not a student has made sufficient progress over the course of a year. This is a source of worry for schools because measuring and reporting pupil progress is hard-baked into our school accountability system. My response about what to do was to tell teachers not to worry too much about progress since attainment is the thing we almost always want to know about anyway. If you still think that ‘progress’ is a meaningful numerical construct, I’d urge you to take a look at Tom Sherrington’s blog post on the matter.

I’ve since become even more convinced that measuring pupil progress is worse than irrelevant through conversations with Ben White, who pointed out to me that intervening on progress data is frequently unjust and disadvantages those who have historically struggled at school. Suppose you find two students who get 47% in your end of year 7 history test. It isn’t a great score and suggests they haven’t learnt many parts of the year’s curriculum sufficiently well. Will you intervene to give either of them support? The response in many secondary schools nowadays would be to interpret the 47% in relation to their Key Stage Two data. For the student who achieved good scaled scores at age 11 of around 107, the 47% suggests they are not on track to achieve predicted GCSE results and so will make a negative contribution to Progress 8. They are therefore marked down for intervention support. The other student left primary school with scaled scores around 94, so despite their poor historical knowledge at the end of Year 7, they are still on track to achieve their own predicted GCSE results. No intervention necessary here. It strikes Ben (and I) as deeply unjust that those who, for whatever reasons (chance, tutoring, high quality primary school, etc…) get high Key Stage 2 scores are then more entitled to support than those who have identical attainment now, but who once held lower Key Stage 2 scores. It would seem to be entrenching pre-existing inequalities in attainment. For me, the only justification for this kind of behaviour is some sort of genetic determinism, where their SATs scores are treated as a proxy for IQ and we should make no special efforts to help students break free of the pre-determined flightpaths we’ve set up for them. Aside from questions of social justice, it makes no sense to expect pupil attainment to follow these predictable trajectories – they simply won’t, regardless of how much you wish they would.

But all of that is an aside and doesn’t address the question of what we should do if we find out that a student hasn’t learnt much / has made poor progress / has fallen behind peers / has low attainment [delete as appropriate according to your conception of what you are trying to measure]. The trouble is, by the time we find out that attainment data is poor in an end-of-year test the damage has already been done and it is very hard to un-do.

The response of most tracking systems to this problem is simply to collect attainment data more frequently, thus bringing forward the point where the damage can be spotted. The problem with this – apart from the destruction of teachers’ lives through test marking and data dropping – is that it is very hard to spot the emergence of falling behind after just six weeks of lessons. Remember we have uncertainty of ‘true’ attainment at each testing point, so it is very hard to use a one-hour test to distinguish genuine difficulties in learning that are causing a student to slip behind their peers (rather than just having a one-off poorer score). If you intervene on everyone that shows poor progress in each six week testing period then you’ll over-intervene with those who don’t really need outside class support, thus spreading your resource too thinly rather than concentrating on the smaller group who really do need help.

There is an alternative. The most forward-thinking leadership teams in schools I have met start by planning what sorts of actions they need information for. Starting with this perspective yields a desire to seek out leading indicators that suggest a student might need some support, before the damage to attainment kicks in. Matthew Evans has a nice blog post where he describes how and why he is trying to prioritise ‘live’ data collection over ‘periodic’ data. Every school’s circumstances is slightly different but the cycle of learning isn’t so unique. Here is some data that really could lead to some actionable changes to improve learning schools:

  1. Which parents do I need to send letters or request meetings about poor school attendance? Data needed = live attendance records. See Stephen Tierney’s blog on how to write an effective letter home to parents.
  2. Which classes do I need to observe to review why school behaviour systems are not proving effective and support the teacher in improving classroom behaviour? Data needed = live behaviour records, logged as a simple code as incidents occur. (Combined with asking teachers how you can help, of course!)
  3. Which students now need an accelerated assessment of why they are not coping with the classroom environment, perhaps across several classrooms? Data needed = combining live behaviour records with periodic student or staff surveys of effort in class, attitudes to learning, levels of distraction. Beware! A music teacher should not be expected to do this for 400 students or for 20 individual classes. Concentrate on deep assessment of newly arrived year groups with simple ‘cause for concern’ calls for established students.
  4. How many students must I create provision for who have specific deficiencies in prior knowledge or skills that will make classes inaccessible? Data needed = periodic assessments of a set of narrowly defined skills – e.g. at the start of secondary school these might be fluency in number bonds, multiplication, arithmetic routines, clear handwriting, sufficiently fast reading speed, basic spelling and grammar. SATs and CAT tests are very poor proxies for these competencies that do not allow for efficiently targeted interventions.
  5. Which students might need alternative provision in place to complete homework? Data needed = live homework records if they are collected, or a period survey of homework completion. If centralised systems do not exist, do not ask every teacher to enter a data point for every student they teach when a simple ’cause for concern’ call will suffice. Many schools are now organising an early parents evening to bring families where homework is an issue into school to find out why. For parents who themselves did not enjoy school, this early conversation might be enough for them to feel motivated to support their own children in completing homework. Otherwise, silent study facilities should be put in place.

Measuring attainment is like a rain collection device that tells us how much it has rained in the past. An action-orientated data collection approach requires us to create barometers – devices that tell us we may have a problem before the damage is done.

Attainment is useful for retrospective monitoring, but is less useful for choosing optimal actions by senior leadership. Of course, this doesn’t mean that teachers should neglect to check that students seem to be learning what is expected of them in day-to-day lessons. But for management it simply isn’t straightforward to generate frequent, reliable, summative assessment data across most subjects. And even if they could, once the attainment data reveals that a student or class has a problem, it has already been going on for some time. Attainment data is a lagged indicator that a student or staff member had a problem. Poor attainment data often comes too late. The trick is to sniff out the leading indicators that tell leaders where to step in before the damage is done.

What if we cannot measure pupil progress?

Testing and recording what students know and can do in a subject has always been part of our education system, especially in secondary schools where teachers simply cannot hold in their head accurate information about the hundreds of students they encounter each week. However, measuring progress – the change in attainment between two points in time – seems to be a rather more recent trend. The system – headteachers, inspectors, advisors – often wants to measure something quite precise: has a child learnt enough in a subject this year, relative to other children who had the same starting point?

The talks I have given recently at ResearchED Durrington and Northern Rocks set out why relatively short, standardised tests that are designed to be administered in a 45-minute/one hour lesson are rarely going to be reliable enough to infer much about individual pupil progress. There is a technical paper and a blog post that outlines some of the work that we’ve been conducting on the EEF test database that led us to start thinking about how these tests are used in schools. This blog post simply sets out a few conclusions to help schools make reasonable inferences from test data.

We can say a lot about attainment, even if progress is poorly measured

No test measures attainment precisely and short tests are inevitably less reliable than long tests. The typical lesson-long tests used by schools at the end of a term or year are reliable enough to infer approximately where a student sits on a bell curve that scores all test-takers from the least good to the best in the subject. This works OK, provided all the students are studying the same curriculum in approximately the same order (a big issue in some subjects)!

Let’s take a student who scored 109 in a maths test at the start of the year. We cannot use that single score to assert that they must be better at maths than someone scoring 108 or 107. However, it is a good bet that they are better at maths than someone scoring 99. This is really useful information about maths attainment.

When we use standardised tests to measure relative progress, we often look to see whether a student has moved up (good) or down (down) the bell curve. This student scored 114 at the end of tell the year. On the face of it this looks like they’ve made good progress, and learnt more than similar students over the course of the year. However, 109 is a noisy measure of what they knew at the start of the year and 114 is a noisy measure of what they knew at the end of the year. Neither test is reliable enough to say if this individual pupil’s progress is actually better or worse than should be expected, given their starting point.


Dylan Wiliam (2010) explains that the challenge of measuring annual test score growth occurs because “the progress of individual students is slow compared to the variability of achievement within the age cohort”. This means that a school will typically find that only a minority of their pupils record a test score growth statistically significantly different from zero.

Aggregation is the friend of reliability

You can make a test more reliable by making it longer, sat over multiple papers, but this isn’t normally compatible with the day-to-day business of teaching and learning. However, teachers who regularly ask students to complete class quizzes and homework have the opportunity to compile a battery of data on how well a student is attaining. Although teachers will understandably worry that this ‘data’ isn’t as valid as a well-designed test, intelligently aggregating test and classwork data is likely to lead to a more reliable inference about a pupil’s attainment than relying on the short end-of-term test alone. (Of course, this ‘rough aggregation’ is exactly what teachers used to do when discussing attainment with parents, before pupil tracking was transferred from the teacher markbook to the centralised tracking software!)

Teacher accountability is the enemy of inference

Teachers always mediate tests in schools. They might help write the test, see it in advance, warn pupils or parents about the impending test, give guidance on revision, advise pupils about the consequences of doing badly, and so on. If the tests are high-stakes for teachers (i.e. used in performance management) and yet low-stakes for the pupils, it can become difficult for the MAT or school to ensure tests are sat in standardised conditions.

For example, if some teachers see the test in advance they might distort advice regarding revision topics in a manner that improves test performance but not the wider pupil knowledge domain. Moreover, some teachers may have an incentive to try to raise the stakes for pupils in an attempt to increase test persistence. The impact of the testing environment and perception of test stakes has been widely studied in the psychometric literature. In short, we need to be sure that standardised tests (of a standardised curriculum) are sat in standardised conditions where students and teachers have standardised perceptions of the importance of the test. For headteachers to make valid inferences across classrooms, or across schools, they need to be clear that they understand how the stakes are being framed for all students taking the test, even those who are not in their own school!

I think this presents a genuine problem for teacher accountability. One of the main reasons we calculate progress figures is to try to hold teachers to account for what they are doing, but very act of raising the stakes for teachers (and not necessarily for pupils) can create variable test environments that threaten our ability to measure progress reliably!

The longer a test is in place, the more it risks distorting curriculum

A test can only ever sample the wider subject knowledge domain you are interested in assessing. This can create a problem where, as teachers become more familiar with the test, they will ‘bend’ their teaching to towards the test items. Once this happens, the test itself becomes a poor proxy for the true subject knowledge domain. There are situations where this can seriously damage pupil learning. For example, many primary teachers report that one very popular standardised test is rather weak on arithmetic compared to SATs; given how important automaticity in arithmetic is, let’s hope no year 3, 4 or 5 teachers are being judged on their class performance in this test!

Our best hopes for avoiding serious curriculum distortion (or assessment washback) are two-fold. First, lower the stakes for teachers (see above). Second, make the test less well-known or less predictable for teachers. In the extreme, we hear of schools that employ external consultants to write end-of-year tests so that the class teachers cannot see them in advance. More realistically, frequently changing the content of the test can help minimise curriculum distortion, but is clearly time-consuming to organise. Furthermore, if the test changes each year then subject departments cannot straightforwardly monitor whether year group cohorts are doing better or worse than previous years.

None of this is a good reason not to make extensive use of tests in class!

Sitting tests and quizzes is an incredibly productive way to learn. Retrieval during a test aids later retention. Testing can produce better organisation of knowledge or schemas. As a consequence of this, testing can even facilitate retrieval of material that was not tested and can improve transfer of knowledge to new contexts.

Tests can be great for motivation. They encourage students to study! They can improve metacognitive monitoring to help students makes sense of what they know (and don’t yet know).

Tests can aid teacher planning and curriculum design. They can identify gaps in knowledge and provide useful feedback to instructors. Planning a series of assessments forces us to clarify what we intend students to learn and to remember in one month, one year, three years, five years, and so on.

Are we better off pretending we can measure progress?

I’m no longer sure that anybody is creating reliable termly or annual pupil progress data by subject. (If you think you are then please tell me how!) Perhaps we don’t really need to have accurate measures of pupil progress to carry on teaching in our classrooms. Education has survived for a long time without them. Perhaps SLT and Ofsted don’t really mind if we aren’t measuring pupil progress, so long as we all pretend we are. Pretending we are measuring pupil progress creates pressure on teachers through the accountability system. Perhaps that’s all we want, even if the metrics are garbage.

Moreover, I don’t know whether the English education system can live in a world where we know that we cannot straightforwardly measure pupil progress. But I am persuaded by this wonderful blogpost (written some time ago) by headteacher Matthew Evans that we must comes to terms with this reality. Like many other commentators on school accountability, he draws an analogy with The Matrix film in which Neo must decide whether to swallow the red or blue pill:

Accepting that we probably can’t tell if learning is taking place is tantamount to the factory manager admitting that he can’t judge the quality of the firm’s product, or the football manager telling his players that he doesn’t know how well they played. The blue pill takes us to a world in which leaders lead with confidence, clarity and certainty. That’s a comfortable world for everyone, not just the leader.

He goes on to argue, however, that we must swallow the red pill, because:

However grim and difficult reality is, at least it is authentic. To willingly deceive ourselves, or be manipulated by a deceitful other (like Descartes’ demon), is somehow to surrender our humanity.

And so, what if we all – teachers, researchers, heads, inspectors – accept that we are not currently measuring pupil progress?

What then?