Classification accuracy and consistency in GCSE and A Level examinations offered by the Assessment and Qualifications Alliance (AQA) November 2008 to June 2009

The aim of this study was to investigate the classification accuracy and consistency in individual units of GCSE and A level examinations offered by the Assessment and Qualifications Alliance (AQA) from November 2008 to June 2009. As marking reliability has been considered extensively elsewhere the scope was limited to those units composed of objective, short answer or structured response test items which were considered to allow the assumption of reliable marking. Two models were used to derive the estimates: an IRT model; and the Livingston and Lewis procedure (1995). The assumptions of the IRT model are more stringent as they assume that parallel tests are equivalent in difficulty. As expected from this difference in assumptions and from the wider literature the indices were lower for the Livingston and Lewis procedure than for the IRT model.

The results showed that, for the GCE and GCSE units analysed, at least 89 per cent of all candidates with a particular grade (other than the highest or lowest grade) have true scores either in that grade or immediately adjacent. For some units the figure is much higher than this, up to 100 per cent. There was more variation at GCSE than there was at GCE. The main reason for this was that the qualification criteria that governed the GCSEs modelled here were less restrictive than they were for GCE; as a result a GCSE could be comprised of anything from two to seven units. The length of the test was in proportion to the percentage of marks the unit accounted for in the total qualification. As a result there were GCSE units where the lowest maximum mark was lower than A level units and others where the highest maximum mark was higher. The mean grade boundary width, which is directly related to classification consistency and accuracy, accordingly shows greater variation for GCSE than for A levels. The GCSE qualification criteria have now been tightened, but still allow some variation in the number of units.

