Making the most of examiner judgement

Neil Stringer considers the role of examiners’ judgements in maintaining examination standards over time.

Exam hallAll exam boards use a combination of statistics and expert opinion to decide how tough an exam is and where the grade boundaries for a particular paper should be set. But can we trust examiners to make reliable judgements about exam papers? Couldn’t we just rely on cold, hard numbers to tell us where grade boundaries should lie?

Some background…

Broadly speaking, there are two ways of presenting exam results. Norm referencing compares your results to those of other test takers. You’d be told you’re in the 50th centile, for example. The other way, criterion referencing, involves comparing your performance to a required standard. For example to achieve grade C, you must demonstrate X, Y, and Z.

The grades we use in GCSEs and A-level results are intended to reflect the knowledge, skills, and understanding that candidates have demonstrated in their exams. In this sense, the grade standards are based on ‘criteria’. However, if these exams were strictly criterion-referenced, a candidate would need to meet every single criterion to achieve a particular grade.

Bending the rules

There are two problems with this approach. The first is pragmatic: research shows that what might appear to examiners to be trivial differences between questions of apparently equal weight can in practice lead to quite substantial differences in the difficulty experienced by the candidates answering them.1,2 This means that meeting the criteria, and therefore achieving a certain grade, will be easier in some versions of the exam than others, which is clearly unfair.

The second problem is a matter of principle: GCSEs and A-levels are designed to balance out performance within and across papers. So, if you are performing at grade A standard across the English syllabus, but can’t use apostrophes appropriately, you won’t receive a grade D simply because one of the criteria for a grade C or above is to use apostrophes correctly. Such compensation allows for a balanced picture of a candidate’s ability.

So, rather than grading candidates’ papers using a list of immutable criteria, papers are marked, awarding credit according to a mark scheme. This establishes the rank order of the candidates who have taken the exam, based on the total marks they have been awarded for their paper. Senior examiners then look at a selection of answer papers across a range of marks to establish where grade boundaries should lie – the marks at which candidates begin to demonstrate the knowledge, skills, and understanding associated with the particular grade. This is done by referring to grade descriptors that outline these elements, and to a selection of the previous year’s answer papers with marks that were on the grade boundary that year.

The task for senior examiners is to judge the relative difficulty of different question papers and set the grade boundaries on the new paper so that candidates would achieve the same grades whichever version of the paper they sat.

Nice idea, but…

As long ago as 1988, Frances Good and Mike Cresswell demonstrated that examiners aren’t very good at this.3,4 In an elegant study, they gave candidates each two versions of an exam paper, one more demanding than the other, and examiners then set grade boundaries for both papers. The result was that the same candidates achieved higher grades on the easier paper than they did on the harder paper. The examiners did not compensate adequately for the difference in difficulty between the two papers.

Since then, research has cast further doubt on the accuracy of examiner judgements, and more powerful statistical techniques for estimating the relative difficulty of exams have been developed.5,7 The error margins associated with these statistical estimates8 are considerably smaller than with examiner judgements.5 So why bother with the examiners at all?

Stats not all, folks

The limitation of statistical models is that they are blind to what candidates actually know, understand and can do. From one year to the next, changes in the overall performance of what the candidates can do are likely to be so small as not to matter; but, over time, performance standards could rise or fall and we would not know.

What we really want to know is what pupils have learned by the time they complete a GCSE or A-level course. That means we need expert eyes looking at what they produce in their exams.

This is not to say that we ought to continue with current practice. For the vast majority of exams, where statistics are much more accurate and precise than human judgement, current awarding methods are a demonstrably ineffective use of examiners’ time and expertise. But whilst we continue to talk about grades in terms of performance standards, and whilst we measure the success of schools by the proportion of their pupils reaching those standards, there remains a need to find an effective way of identifying those standards, and of detecting any change in them.

Neil Stringer


  1. Foxman, D., et al. (1985). A review of monitoring in mathematics 1978 to 1982. London: Assessment of Performance Unit.
  2. Pollitt, A., et al. (1985). What makes exam questions difficult? Edinburgh: Scottish Academic Press.
  3. Good, F. J., & Cresswell, M. J. (1988). Grading the GCSE. London: Secondary Examinations Council.
  4. Good, F. J., & Cresswell, M. J. (1988). Differentiated assessment: grading and related issues. London: Secondary Examinations Council.
  5. Baird, J.-A., & Dhillon, D. (2005). Qualitative expert judgements on examination standards: valid, but inexact (RPA_05_JB_RP_077). Manchester: AQA Centre for Education Research and Policy.
  6. Cresswell, M. J. (1997, p. 482). Examining judgements: theory and practice of awarding public examination grades. London: University of London, Institute of Education.
  7. Stringer, N. S. (2012). Setting and maintaining GCSE and GCE grading standards: the case for contextualised cohort-referencing. Research Papers in Education, 27(5), 535-554.
  8. Benton, T., & Lin, Y. (2011). Investigating the relationship between A level results and prior attainment at GCSE. Coventry: Office of Qualifications and Examinations Regulation.

Share this page