Making tests comparable

Anton Béguin points to three ways of making sure tests are comparable over time and urges that they be used in combination where possible.

Students sitting an examA fair comparison between test-takers who undertake different versions of a test is often an important aspect of an examination system. For example, if an institute of higher education wants to use grades from examinations in secondary education for the selection of students, the grades for different years need to be comparable. Two types of procedures facilitate the comparability of test results: procedures with a focus on the content standard, and procedures with a focus on performance standards.

Content standards are maintained during test construction, using a test matrix or test blueprint. Performance standards are maintained by the use of one or more of three types of information. The first is the use of experts to evaluate how difficult the questions are. The second involves the use of population performance levels, either under the assumption of random equivalent groups, or corrected for background variables such as prior performance on a test administered earlier in the educational career of the students. The third approach is to link information on the basis of the administration of common items from multiple types of test.

In large-scale tests in the UK, it is common to use both difficulty evaluation involving experts, and population evidence, in level setting. Both are used as sources of information to maintain standards and to set comparable cut-scores between levels or grades. The use of linking information is rarer, but it could be a valuable addition to the other sources of information.

Of these three sources of information, the performance of the population is the most stable. Here, variance between experts does not play a role and sampling error can often be neglected. As long as the performance of the population (either with or without correction for background variables) is stable over some years, an assumption of random equivalent groups can be used. In situations where this assumption is not plausible, additional statistical information can be collected in a separate research exercise using the linking approach mentioned above. Items for a new or future test are commonly administered together with items from an older form of the test. In this way, the difficulty of items in the new test can be estimated and compared to the items in the previous version.

Several data collection strategies are possible for the collection of linking information, but a crucial element in the data collection is the conditions under which testing is administered. The performance of students in a high-stakes condition will often be better than in conditions with less severe consequences for the students taking the test. Ignoring this effect could lead to incorrect evaluations and biased level-setting.

The most commonly used linking design is the anchor test design. This design uses a relatively small dedicated test to compare the performance of populations over years. This so-called anchor is administered together with the live test to a sample of students. Since this anchor is used in multiple years, relative performance can be compared between years. A second design uses an anchor as an integral part of the live test. For a sample of students, some items are removed from the live test and replaced by items from the anchor test. This design is not affected by condition effects, since all the items are administered in a high-stakes condition. In a third type of design, data is collected in a test administration which is separate from the live administration. For example, items from two versions of a test can be combined in a pre-test design in which items are administered to evaluate their performance. A variation of this is the combined administration of a dedicated anchor test and some items from future test versions.

In summary, a variety of options is available to collect additional linking information that could provide useful additional information about trends in performance. Additional linking information is a valuable addition to expert evaluation and to the comparison of population performance.

Anton Béguin is director of research at Cito, a testing and assessment company based in the Netherlands, and a member of the CERP Advisory Group.

Anton will be presenting at a private AQA seminar being held today for senior policymakers, examining different approaches to maintaining standards over time.

Share this page