Yesterday, Adam Robbins posted on Twitter an excellent thread about assessment that included this comment on Question Level Analysis (QLA). It’s easy to say that QLA generally isn’t a good idea (and I used to say this), but I keep encountering teachers who insist that they find analysing individual questions on a test helpful! My post below is the outcome of a long email exchange some years ago with a primary teacher who loved QLA, where we finally agreed with each other about the useful scope of analysing question responses. Let’s see if you agree with the two of us!

Imagine your students have just sat a standardised maths test and you want to know whether they understand fractions. **If** the test had space for 30 questions about fractions then there is a good chance it could be designed so that we can make valid inferences as to the security of understanding about fractions. However, in this test all we know is whether they could correctly answer **three** fractions questions (out of 30 maths questions on the test):

- Which is larger: 3/4 or 7/16?
- A long word problem which requires them to compile and then answer 3/8+1/4
- What is 2/3 + 2/5?

Ideally, we don’t really want to know whether they can answer these questions. We want to know whether they understand the fractions topic within the curriculum. We can’t straightforwardly infer one thing from the other. However, this doesn’t mean the test score information tells us *nothing *about the class’s understanding of fractions. I’m going to briefly summarise why inference about any one individual child is quite limited, but then describe what could be said about the whole class and about the selection of students for intervention.

## Limited (but not zero) inference about an individual student’s understanding of the topic domain

A single student’s responses to questions undoubtedly tell us *something *about how well they understand fractions, particularly if they scored 0/3! However, they cannot tell us about their performance within the entire topic domain for the following reasons:

- Performance on just three question is not going to be a very reliable indicator of how well they understood those exact questions. They might have well have achieved a different score on a different day, thanks to lucky ‘guesses’, silly errors or use of non-generalisable strategies.
- Their performance tells us something useful, but not everything, about how well they understand the precise topic that was tested at that level of difficulty, e.g. adding two fractions with different denominators. If they got that question right, there
*are*quite good odds they know how add fractions of a similar level of difficulty. However, if they got the question wrong, we face the problem that don’t know whether the error arose from poor understanding of fractions, or of something else (such as arithmetic mis-understanding, mis-reading or error). Even in maths, test questions tend to rely on knowledge from multiple topic domains. - It becomes even more difficult to know whether they can answer more difficult or less difficult questions within the same precise topic (e.g. 1/2 + 1/4; 3/8 + 1/4 + 2/5)
- Their responses to these three questions don’t allow us infer their understanding of related content within the topic of fractions, such as subtraction or multiplication of fractions (though we might infer a little more about the former, which is closely related to addition, than the latter).

(If this kind of argument is new to you, I’d urge you to read more about sampling and inference within knowledge domains in this brilliant blog by Adam Boxer.)

## A starting point for inference about a whole class’s understanding of the topic domain

Some of the issues of inference are resolved when we look across the responses of thirty children in the class. The ‘noise’ of lucky guesses and silly errors becomes less important when we are able to see that, for example, only 8 out of 30 students successfully answered the question 2/3 +2/5 but 22 out of 30 students could do the equivalent fractions question. Yes, we still don’t know how well the class understand the topic of fractions overall. But this information can give a teacher a useful starting point for working out where the re-teaching needs to start. (After seeing this data they might, for example, want to spend 15 minutes using mini-whiteboards to assess class understanding of easier fraction addition questions so learn more about where difficulties start to be encountered.)

However, there can be a serious issue with using test questions to guide topic selection for re-teaching. Not only do test questions often cover multiple topics, but they also vary by question difficulty (both in question type and content). We can imagine in this example, a different version of the test with an easier fractions addition and a harder equivalent fractions question and that might lead us to decide to spend our scarce time re-teaching equivalent fractions.

For this reason, test companies tend to give schools additional analysis that can help you see whether your own students did *relatively* well, or poorly on individual questions (and topics) compared to a standardised sample. This does some of the work in adjusting for question difficulty when considering the proportion of class who answered correctly. (Below is an example from NFER.)

Whilst I would be curious to look at this type of QLA report, my main worry would be that my own students’ relative performance on an individual test question would be *very* sensitive to the way that topic was taught within the materials I use *and* the timing of the test relative to the original teaching episode. Students tend to do better at questions more recently taught. So, for example, if my students do brilliantly well at fractions questions compared to other schools, is this simply because we’ve just finished the topic and it is fresh in their minds, whereas other schools did it over a term ago? If so, it may tell us little about relative success in recall in three months’ time.

## Selecting students for intervention based on test performance data

In my email exchange with the primary teacher about QLA, they asked me whether they should use standardised tests to help them decide which students needed interventions. Imagine, for example, a tutor appeared at the door asking for five students to work on a fractions intervention. They asked me: Isn’t it fairest to just select the five students who got the most answers wrong on those three fractions questions?

Well, ‘fair’ is obviously a tricky question, so let’s re-word it to ask whether the five lowest scores represent the five weakest students at fractions in the class. Unfortunately, they may not be the weakest. In order to select students for an intervention group, we are back to having to make valid and reliable inferences about how well students understand fractions based on just three questions. Whilst we might agree those three questions can help us identify some students who *don’t *need to join the intervention group, they can’t reliably discriminate between the larger group within the class who appear to have some level of insecure understanding of fractions.

This doesn’t mean we are completely stuck. Reliable judgements about individual students don’t necessarily need long tests since we can look back over all information we have about their understanding of fractions – past classwork, quizzes, tests – to form a holistic judgement about whether they should be in the group of five who receive the intervention session. The information about their performance on these three questions *does* have value to us – especially since it relates to their current recall of fractions rather than performance during the original teaching episode – but needn’t (and shouldn’t) be used exclusively.

## Conclusion

It’s easy to be a purist about assessment and say that teachers are making erroneous inferences from test data. I’m often very guilty of doing this because I don’t have to make decisions in the classroom. I used to say that QLA is a waste of time and I still think that individual student analysis of questions isn’t that much use. However, it would be wrong to dismiss student responses to individual questions as telling us ‘nothing’. Of course, they provide teachers with new information and allow us to update our priors about the understanding of a topic.

Teachers will naturally want to think about how well a class collectively managed to answer particular questions on a particular test because it is a good starting point for a process of enquiry and reflection into what their classes understand and can do. To that end, the QLA reports provided by standardised test providers have some value as a *starting* point for inference, but only if used with great care.