Anyone who has ever taken a test when the stakes are high—to get a job, promotion or certification, to graduate from high school, to enter college or the military, or to become a citizen—understands the importance of accurate test scores.
We tend to trust the test is well-designed to meet its intended purpose and the associated scores reflect the target knowledge, skills, and abilities. Such scores have a tremendous impact on our lives and our children’s lives, so we often assume that the care given to producing them is commensurate with their potential consequences. How well-placed is that trust? Simply put, mistakes can and do happen. How do we minimize the chance mistakes will occur? In our experience, rigorous quality assurance (QA) at every step of the score generation process is the essential element.
Getting QA Right—It’s Not a Simple Process
High-stakes educational assessments generate a variety of scores, including total or “raw scores,” normative or relative performance scores, and a host of “derived” scores known as scale scores. From these, additional reporting information such as level of performance or proficiency and distance to the “passing standard” are also generated. Even ignoring whether the students’ scores are correctly coded, scanned, or rated (e.g., for essay items), generating these scores is extremely complicated.
Generating test scores involves much more than simply summing item-level responses (raw scores). In most cases, there are a host of rules to ensure item responses are valid and the summation process is accurate (including metrics like attemptedness checks, timing rules, and subsection processing). Once the item responses are deemed “valid”, an underlying measurement or statistical model is typically used to transform these responses into derived scores like percentiles or scaled scores. Measurement models may require the estimation of several item parameters (characteristics of the test questions), potentially addressing item difficulty, discrimination (how well the item separates higher- from lower-ability students), and even how readily the correct response might be guessed by a student.
To add even more complexity, items worth more than one point have multiple difficulty parameters that are estimated based on the performance of a large sample of students using complex maximum-likelihood statistics. The parameters are then linked to the item, and in conjunction with students’ scores, are used to estimate performance (typically via a derived scaled score). There are many more steps before a student receives a score report, but the point we are making is that generating student scores is a complex endeavor—for even the simplest assessment program reporting total or raw scores—and most programs are vastly more complicated than that.
Building Trust in Scores Through Decades of QA Experience
In the mid-1990s, HumRRO contracted with a state to implement a research agenda for their newly launched assessment program. Our work centered on investigating the reliability, validity, fairness, utility, and consequences of the new assessment and the associated accountability system. To be clear, we were not hired to conduct QA, though that quickly became a key focus once a scoring error was discovered (and the testing vendor’s contract was terminated). At that point, the state asked HumRRO to complete operational psychometric processing to generate students’ scores. We did so, but all the while the dangerous and very real possibility of making a processing error loomed large in our minds.
To navigate this challenge, we relied on our experience with high-stakes, high-visibility tests like the Armed Services Vocational Aptitude Battery (ASVAB) and professional certification tests to guide our thinking. We first adopted the bedrock practice of having everything related to processing done twice, by teams working entirely independently (later comparing and resolving any differences that emerged during processing). We knew that anyone could—and some eventually would—make some sort of processing error or mistake. We reasoned that the odds of two completely independent teams making exactly the same error at exactly the same time were suitably astronomical.
Even so, our relief was palpable when the state contracted with a new testing vendor, making us no longer solely responsible for psychometric processing. Importantly, the state decided instead to institute a formal third-party independent replication step for psychometric processing in the late 1990s. HumRRO has conducted psychometric processing for this state ever since. Perhaps counterintuitively, it has been extremely rare for our results and the testing vendor’s results to match perfectly after initial data processing during that time. Often, there is some processing rule that must be clarified, or an issue arises that is resolved differently by each team, leading to small differences in student scores. This is often a learning process, because these issues are bought to the state for consideration and resolution, improving the transparency of the process and giving the state more control of decisions that impact their students and educators. Other times, the discrepancies stemmed from simple human error or different interpretations of the processing steps or rules by different individuals. But the critical bottom line is this: such discrepancies were identified and mitigated before they could negatively impact students’ lives.
HumRRO’s QA Work Today
HumRRO’s role as the go-to QA partner for states has grown considerably. While it would be gratifying to attribute that growth entirely to the quality of our work, it was facilitated by states discovering numerous psychometric processing errors as they were forced to scale up their assessment programs to comply with federal legislation, beginning with the No Child Left Behind Act and then followed by the Every Student Succeeds Act. Psychometric staff were stretched thin as states doubled or tripled the numbers and simultaneously increased the complexity of tests they gave each year. It may not be surprising that errors ensued. Some of these errors were well-publicized while others were quietly corrected, but states had no choice but to recognize the value of rigorous QA for psychometric processing. We were fortunate to have processes in place to help states ensure that their students’ scores were correct.
Today, HumRRO replicates psychometric processing for several state education agencies, either under direct contract or as a sub-contractor to a testing contractor. We are proud that we can replicate, in real time, the psychometric processing of nearly every major testing company in the United States. Because we are an independent, nonprofit organization, we do so without an agenda biased toward any particular outcome. QA must be done in real time, objectively and accurately. Replication that occurs after operational processing can only discover errors already made. Replication in real time can ensure that processing mistakes never get the chance to become errors on score reports.
Extending QA to Accountability Systems
We are also proud of our work with states in providing quality assurance for their accountability systems. These systems use students’ test scores and other information to generate school and district level data and accountability classifications, which are then used to aid and support schools that are struggling the most. These systems often rely on traditional student scores and classifications, growth indices, graduation rates, absenteeism, college/career readiness indicators, student group performance, English language development among non-native speakers, alternate assessment scores for students with the most severe cognitive disabilities, and other state-specific indicators. States vary greatly in their approach to accountability, and generating a school-level classification is always very complex. A mistake at this level could lead to schools not getting the help they need, or to undeserved scrutiny for better-performing schools.
Looking Toward the Future
Educational assessment has recently come under fire for its perceived role in perpetuating cultural stereotypes and maintaining opportunity gaps for traditionally under-represented groups. In response, many colleges have changed their admissions policies to be “test optional” (though this trend may have reached its zenith). Educational assessments have also been scrutinized for over-promising the extent to which scores may be interpreted and used. States must test students in reading and math in grades 3-8 and high school according to the federal Every Student Succeeds Act, and while most of those tests are end-of-year summative tests, they are often “shoe-horned” into service to inform instruction as well. These and other similar criticisms erode the public’s confidence in otherwise highly effective and useful educational tools. These concerns could be compounded if more attention were given to the idea that the scores could simply be wrong.
Assessment plays a vital role in effective instruction and learning. Test scores are used for myriad purposes, and most of those purposes have important consequences for someone: Educators, parents, and students all trust the information they are given to make vital educational decisions. Our QA quality processes should be rigorous enough to merit that trust. Independent replication is one of the strongest methods of QA for psychometric processing and should be standard practice for any assessment program.