Inter-subject comparability of examination standards in GCSE and GCE in England

  1. Alton, A., & Pearson, S. (1996). Statistical approaches to inter-subject comparability (Unpublished UCLES research paper).
  2. Andrich, D. (1978). A binomial latent trait model for the study of Likert-style attitude questionnaires. British Journal of Mathematical and Statistical Psychology, 31, 84–98.
  3. Andrich, D. (2015). The problem with the step metaphor for polytomous models for ordinal assessments. Educational Measurement: Issues and Practice, 34, 8–14.
  4. Baird, J., Cresswell, M., & Newton, P. (2000). Would the real gold standard please step forward? Research Papers in Education, 15, 213–229.
  5. Bramley, T. (2011). Subject difficulty—the analogy with question difficulty. In Research matters: A Cambridge assessment publication, special issue 2: Comparability, 27–33.
  6. Bramley, T. (2016). The effect of subject choice on the apparent relative difficulty of different subjects (Cambridge Assessment Research Report). Cambridge: Cambridge Assessment.
  7. Coe, R. (2008). Comparability of GCSE examinations in different subjects: An application of the Rasch model. Oxford Review of Education, 34, 609–636.
  8. Coe, R., Searle, J., Barmby, P., Jones, K., & Higgins, S. (2008). Relative difficulty of examinations in different subjects (Report for SCORE—Science Community Supporting Education). CEM Centre: Durham University. Retrieved from
  9. Elliot, G. (2013). A guide to comparability terminology and methods. Cambridge: Cambridge Assessment. Retrieved from
  10. He, Q., Anwyll, S., Glanville, M., & Opposs, D. (2014). An investigation of measurement invariance of Key Stage 2 National Curriculum science sampling test in England. Research Papers in Education, 29, 211–239.
  11. Kolen, M., & Brennan, R. (2014). Test equating, scaling, and linking. Methods and practices (3rd ed.). New York: Springer.
  12. Korobko, O., Glas, C., Bosker, R., & Luyten, J. (2008). Comparing the difficulty of examination subjects with item response theory. Journal of Educational Measurement, 45, 139–157.
  13. Lamprianou, I. (2009). Comparability of examination standards between subjects: An international perspective. Oxford Review of Education, 35, 205–226.
  14. Linacre, J. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878.
  15. Linacre, J. (2015). Winsteps® Rasch measurement computer program user’s guide. Beaverton, OR:
  16. Lockyer, C., & Newton, P. (2015). Inter-subject comparability: A review of the technical literature. Coventry: Ofqual.
  17. Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
  18. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
  19. Newton, P. (2012). Making sense of decades of debate on inter-subject comparability in England. Assessment in Education, 19, 251–273.
  20. Newton, P. (2015). Exploring implications of policy options concerning inter-subject comparability (ISC Working Paper 6). Coventry: Ofqual.
  21. Newton, P., Baird, J., Goldstein, H., Patrick, H., & Tymms, P. (Eds.). (2007). Techniques for monitoring the comparability of examination standards. London: Qualifications and Curriculum Authority.
  22. Newton, P. E., He, Q., & Black, B. (2017). Progression from GCSE to A level: Comparative Progression Analysis as a new approach to investigating inter-subject comparability. Coventry: Ofqual.
  23. Ofqual. (2014a). GCSE (9 to 1) qualification level conditions and requirements. Coventry: Ofqual. Retrieved from
  24. Ofqual. (2014b). Setting GCSE, AS and A level grade standards in summer 2014 and 2015. Coventry: Ofqual. Retrieved from
  25. Ofqual. (2015). GCE qualification level conditions and requirements. Coventry: Ofqual. Retrieved from
  26. Ofqual. (2016). A policy position for Ofqual on inter-subject comparability. Coventry: Ofqual.
  27. Opposs, D. (2015). Inter-subject comparability: International review. Coventry: Ofqual.
  28. Pae, H. (2012). A psychometric measurement model for adult English language learners: Pearson test of English academic. Educational Research and Evaluation, 18, 211–229.
  29. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Denmark Paedagogiske Institute. (Expanded edition, 1980. Chicago: University of Chicago Press.)
  30. Reckase, M. (2009). Multidimensional item response theory. New York: Springer-Verlag.
  31. Reeve, B., & Fayers, P. (2005). Applying item response theory modelling for evaluating questionnaire item and scale properties. In P. Fayers & R. Hays (Eds.), Assessing quality of life in clinical trials: Methods and practice (pp. 55–73). Oxford: Oxford University Press.
  32. Smith, E. (2002). Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. Journal of Applied Measurement, 3, 205–231.
  33. Tan, J., & Yates, S. (2007). A rasch analysis of the academic self-concept questionnaire. International Education Journal, 8, 470–484.
  34. Tendeiro, J., & Meijer, R. (2015). How serious is IRT misfit for practical decision-making? (Law School Admission Council Research Report, RR 15-04). Retrieved from
  35. Wong, H., McGrath, C., & King, N. (2011). Rasch validation of the early childhood oral health impact scale. Community Dent Oral Epidemiology, 39, 449–457.
  36. Wright, B., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press.
  37. Wu, M., & Adams, R. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions.
  38. Yen, W. (1993). Scaling performance assessment: Strategies for managing local item dependence. Journal of Educational Measurement, 20, 187–213.
  39. Zhao, Y., & Hambleton, R. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8, 484.