Testing and recording what students know and can do in a subject has always been part of our education system, especially in secondary schools where teachers simply cannot hold in their head accurate information about the hundreds of students they encounter each week. However, measuring progress – the change in attainment between two points in time – seems to be a rather more recent trend. The system – headteachers, inspectors, advisors – often wants to measure something quite precise: has a child learn enough in a subject this year, relative to other children who had the same starting point?
The talks I have given recently at ResearchED Durrington and Northern Rocks set out why relatively short, standardised tests that are designed to be administered in a 45-minute/one hour lesson are rarely going to be reliable enough to infer much about individual pupil progress. There is a technical paper and a blog post that outlines some of the work that we’ve been conducting on the EEF test database that led us to start thinking about how these tests are used in schools. This blog post simply sets out a few conclusions to help schools make reasonable inferences from test data.
We can say a lot about attainment, even if progress is poorly measured
No test measures attainment precisely and short tests are inevitably less reliable than long tests. The typical lesson-long tests used by schools at the end of a term or year are reliable enough to infer approximately where a student sits on a bell curve that scores all test-takers from the least good to the best in the subject. This works OK, provided all the students are studying the same curriculum in approximately the same order (a big issue in some subjects)!
Let’s take a student who scored 109 in a maths test at the start of the year. We cannot use that single score to assert that they must be better at maths than someone scoring 108 or 107. However, it is a good bet that they are better at maths than someone scoring 99. This is really useful information about maths attainment.
When we use standardised tests to measure relative progress, we often look to see whether a student has moved up (good) or down (down) the bell curve. This student scored 114 at the end of tell the year. On the face of it this looks like they’ve made good progress, and learnt more than similar students over the course of the year. However, 109 is a noisy measure of what they knew at the start of the year and 114 is a noisy measure of what they knew at the end of the year. Neither test is reliable enough to say if this individual pupil’s progress is actually better or worse than should be expected, given their starting point.
Dylan Wiliam (2010) explains that the challenge of measuring annual test score growth occurs because “the progress of individual students is slow compared to the variability of achievement within the age cohort”. This means that a school will typically find that only a minority of their pupils record a test score growth statistically significantly different from zero.
Aggregation is the friend of reliability
You can make a test more reliable by making it longer, sat over multiple papers, but this isn’t normally compatible with the day-to-day business of teaching and learning. However, teachers who regularly ask students to complete class quizzes and homework have the opportunity to compile a battery of data on how well a student is attaining. Although teachers will understandably worry that this ‘data’ isn’t as valid as a well-designed test, intelligently aggregating test and classwork data is likely to lead to a more reliable inference about a pupil’s attainment than relying on the short end-of-term test alone. (Of course, this ‘rough aggregation’ is exactly what teachers used to do when discussing attainment with parents, before pupil tracking was transferred from the teacher markbook to the centralised tracking software!)
Teacher accountability is the enemy of inference
Teachers always mediate tests in schools. They might help write the test, see it in advance, warn pupils or parents about the impending test, give guidance on revision, advise pupils about the consequences of doing badly, and so on. If the tests are high-stakes for teachers (i.e. used in performance management) and yet low-stakes for the pupils, it can become difficult for the MAT or school to ensure tests are sat in standardised conditions.
For example, if some teachers see the test in advance they might distort advice regarding revision topics in a manner that improves test performance but not the wider pupil knowledge domain. Moreover, some teachers may have an incentive to try to raise the stakes for pupils in an attempt to increase test persistence. The impact of the testing environment and perception of test stakes has been widely studied in the psychometric literature. In short, we need to be sure that standardised tests (of a standardised curriculum) are sat in standardised conditions where students and teachers have standardised perceptions of the importance of the test. For headteachers to make valid inferences across classrooms, or across schools, they need to be clear that they understand how the stakes are being framed for all students taking the test, even those who are not in their own school!
I think this presents a genuine problem for teacher accountability. One of the main reasons we calculate progress figures is to try to hold teachers to account for what they are doing, but very act of raising the stakes for teachers (and not necessarily for pupils) can create variable test environments that threaten our ability to measure progress reliably!
The longer a test is in place, the more it risks distorting curriculum
A test can only ever sample the wider subject knowledge domain you are interested in assessing. This can create a problem where, as teachers become more familiar with the test, they will ‘bend’ their teaching to towards the test items. Once this happens, the test itself becomes a poor proxy for the true subject knowledge domain. There are situations where this can seriously damage pupil learning. For example, many primary teachers report that one very popular standardised test is rather weak on arithmetic compared to SATs; given how important automaticity in arithmetic is, let’s hope no year 3, 4 or 5 teachers are being judged on their class performance in this test!
Our best hopes for avoiding serious curriculum distortion (or assessment washback) are two-fold. First, lower the stakes for teachers (see above). Second, make the test less well-known or less predictable for teachers. In the extreme, we hear of schools that employ external consultants to write end-of-year tests so that the class teachers cannot see them in advance. More realistically, frequently changing the content of the test can help minimise curriculum distortion, but is clearly time-consuming to organise. Furthermore, if the test changes each year then subject departments cannot straightforwardly monitor whether year group cohorts are doing better or worse than previous years.
None of this is a good reason not to make extensive use of tests in class!
Sitting tests and quizzes is an incredibly productive way to learn. Retrieval during a test aids later retention. Testing can produce better organisation of knowledge or schemas. As a consequence of this, testing can even facilitate retrieval of material that was not tested and can improve transfer of knowledge to new contexts.
Tests can be great for motivation. They encourage students to study! They can improve metacognitive monitoring to help students makes sense of what they know (and don’t yet know).
Tests can aid teacher planning and curriculum design. They can identify gaps in knowledge and provide useful feedback to instructors. Planning a series of assessments forces us to clarify what we intend students to learn and to remember in one month, one year, three years, five years, and so on.
Are we better off pretending we can measure progress?
I’m no longer sure that anybody is creating reliable termly or annual pupil progress data by subject. (If you think you are then please tell me how!) Perhaps we don’t really need to have accurate measures of pupil progress to carry on teaching in our classrooms. Education has survived for a long time without them. Perhaps SLT and Ofsted don’t really mind if we aren’t measuring pupil progress, so long as we all pretend we are. Pretending we are measuring pupil progress creates pressure on teachers through the accountability system. Perhaps that’s all we want, even if the metrics are garbage.
Moreover, I don’t know whether the English education system can live in a world where we know that we cannot straightforwardly measure pupil progress. But I am persuaded by this wonderful blogpost (written some time ago) by headteacher Matthew Evans that we must comes to terms with this reality. Like many other commentators on school accountability, he draws an analogy with The Matrix film in which Neo must decide whether to swallow the red or blue pill:
Accepting that we probably can’t tell if learning is taking place is tantamount to the factory manager admitting that he can’t judge the quality of the firm’s product, or the football manager telling his players that he doesn’t know how well they played. The blue pill takes us to a world in which leaders lead with confidence, clarity and certainty. That’s a comfortable world for everyone, not just the leader.
He goes on to argue, however, that we must swallow the red pill, because:
However grim and difficult reality is, at least it is authentic. To willingly deceive ourselves, or be manipulated by a deceitful other (like Descartes’ demon), is somehow to surrender our humanity.
And so, what if we all – teachers, researchers, heads, inspectors – accept that we are not currently measuring pupil progress?