The ticking clock of assessment

Can we accurately measure how standards of achievement change over time?

Clock and gearsDeep in a mountain in Western Texas, a clock is being built that will tick for 10,000 years. The mechanical pendulum of the clock will be so accurate that it will never be more than 800 milliseconds out per day, about one tick per year. The readings from the pendulum will then be adjusted on the basis of measurements of the position of the Sun and the speed of the Earth’s rotation1. Built to withstand volcanic eruptions or a century of overcast skies blocking the clock’s solar readings, the timepiece is designed to maintain accuracy to within a five-minute tick over its entire 10,000-year lifetime.

The 10,000-year Clock is an example of measurement at its most precise. The designers have created an ideal world for a ceaseless clock, within a mountain in a high dry desert, and with even temperatures over seasons and by the day.

Carved into the mountain are five room-sized anniversary chambers, to celebrate the clock’s 1 year, 10 year, 100 year, 1,000 year, and 10,000 year anniversaries. Until the date of each anniversary, no one will be allowed to enter the anniversary chambers to bear witness to the success of the clock.

Marking time

How different the ceaseless clock is from educational measurement!

In assessment, we sample questions from a domain of interest. To determine whether candidates can interpret a range of graphs and diagrams, we offer them a stem-and-leaf chart, in a particular context, with particular numbers attached. We then determine what represents a good and a bad answer to a question on the basis of that source material. Then we weigh this answer up against all of the other answers in the test, across multiple curriculum areas criss-crossed by skills areas, and compare this information with past examination series to try to determine a standard.

At the moment, the weighing is done by subject experts, but it could be carried out statistically if we were to pre-test more of the questions, or repeat the questions over time. Presumably this will be the approach taken by the national sampling test proposed by the Secretary of State. However, while the methods differ, the problems remain the same. What if we had chosen different questions? Used different numbers? Different markers? Different weightings for questions?

In England, we have tended to design our assessments with the domain we are testing uppermost in our mind. Our assessments must measure wide domains of ability across a variety of modes of response. In the US, measurements have been far more restricted, because, their statisticians will tell you, it is hard to estimate how questions selected from different parts of a domain, and measured using different modes of assessment, relate to each other.

When we then ask different candidates different questions in a different assessment, we have to worry about how the questions within each of those assessments relate to each other, and about how the two assessments relate to each other. This makes it staggeringly difficult to derive precise measurements, whether they are based on statistics or expert judgement.

A matter of scale

If what we are intending to measure is very large, and we can cope with a large degree of inaccuracy, then imprecise measurement instruments are not problematic. We can pace out a cricket strip if we are playing cricket in a park; at Lords, this would not do. Governments would like us to measure the success of their education policies, and again it would seem that pacing them out won’t really do.

If government education policies are tremendously successful, then success or failure would be relatively easy to determine, as the signal to noise ratio would be very high. Unfortunately, evidence from the US suggests that the impact of educational interventions when evaluated using tests with relatively broad subject coverage is small: around 0.08 of a standard deviation2. In other words, very little difference in high-stakes achievement is likely to be seen even after a successful educational intervention. Statisticians in the Netherlands have concluded that it is simply not possible to measure change in achievement from one year to the next with the required degree of accuracy3.

Changing times

While performance in any domain across an entire educational system is unlikely to change greatly from one year to the next, over a longer period of time there may be cumulative change that is large enough to detect. International tests such as PISA and TIMMS benefit from measuring change every four years rather than every year. Over a period of several years, long-term trends can be isolated from noise.

Of course, the major issue with measuring change in every assessment system is that the measures need to keep changing. Every decade brings with it the educational equivalent of a volcano, and yet we expect our clocks to keep ticking. We could hole up our assessment system deep within a mountain in Texas and open the door only when a decade has passed. We might find, however, that in the meantime, everyone else has switched to lunar standard time.

Chris Wheadon

References

  1. Hillis, D., Seaman, R., Allen, S., & Giorgini, J. (2011). Time in the 10,000-Year Clock. arXiv preprint arXiv:1112.3004.  http://arxiv.org/ftp/arxiv/papers/1112/1112.3004.pdf
  2. Hout, M., & Elliott, S. W. (2011). Incentives and Test-Based Accountability in Education. Washington, D.C.: The National Academies Press. Retrieved from http://www.nap.edu/catalog.php?record_id=12521
  3. Béguin, A. (2012). Use of different sources of information in maintaining standards: examples from the Netherlands. In T. J. H. M. Eggen, & B. P. Veldkamp (Eds.), Psychometrics in Practice at RCEC (pp. 27–38). RCEC. Retrieved from http://doc.utwente.nl/80199/

Share this page