The importance of being accurate

Ben Smith investigates how we can measure marking reliability

Marking exam papersOn the face of it, measurement is not the most compelling topic. But getting it right can be extremely important – you wouldn’t want your pharmacist mixing the wrong quantities of chemicals into your antibiotics, for instance! In CERP, we concern ourselves with a very different type of measurement: measuring attainment in a particular subject area. Unlike a pharmacist, we can’t pull out scales or measuring cylinders to measure a student’s attainment in, say, English. Instead, our measurement tool is an exam. A student’s score on the exam indicates their level of attainment, and that score is achieved through marking their responses to each question. While exam results are not such a life-or-death matter as pharmaceuticals, students’ futures can depend on them. As such, it’s vital that the results are as accurate as possible.

One way that we try to ensure results are accurate is through monitoring marking reliability. Something is considered reliable when it can either be repeated or tested several times and a similar outcome is produced. So, we make sure that a proportion of students’ responses to any question are marked multiple times, by several different examiners. This gives us some indication of how consistent – and thus reliable – the marking has been for a particular question, and overall for an entire exam.

So far, it sounds quite simple. The complication is that there are dozens of different ways you can turn this information into a numerical value that represents marking reliability – just as you can measure distance using a measuring tape, metre stick, laser tool, trigonometry, or even a length of string. Of course, not all of these tools are equally suited to measuring different distances: it is probably more practical to use trigonometry to work out the height of a house than to use a metre stick. The same principle applies to the various ways we can estimate marking reliability.

Reliability also needs to be balanced against validity – whether or not the mark allocated actually measures what it is supposed to measure. We could easily achieve consistent marking if all the questions were multiple choice or required one-word answers. But whether either of these is the most valid method of assessment – particularly for subjects like English or History – is an entirely different matter. And while simpler questions can be marked by computers with perfect accuracy, more complex marking is more commonly performed by examiners. Great efforts are taken to ensure that examiners mark consistently and accurately, but they are – just like the rest of us – human. The occasional mistake and some degree of subjectivity are inevitable. So, whenever we consider marking reliability, we also need to think very carefully about what skills we really want to assess, and exactly how much unreliability we think is reasonable.

Here in CERP, we’ve been exploring these issues in depth. We have compared and contrasted some of the different methods that can be used to quantify marking reliability, and have attempted to establish which method is the most appropriate in different circumstances. 

Ben Smith

Share this page