“Everyone hates summative end-of-year assessments!” Complaints about them are rampant among educators, legislators, parents, students, and even some measurement professionals. The tests come at the wrong time, just when students are changing teachers, grades, or schools. They don’t provide useful information, only proficiency ratings or nearly meaningless sub-scores that are not diagnostic/formative for students. They are stressful for educators and students, as summative scores account for the lion’s share of most states’ school accountability determinations.

These and other concerns with summative assessments have steadily grown in prominence since the passing of the federal No Child Left Behind (NCLB) Act of 2001. And while the current iteration of that law, the Every Student Succeeds Act (ESSA) of 2015, reduced some of the rigidness of federal accountability requirements, it did little to assuage the severity of dissatisfaction with summative tests.

Summative end-of-year assessments are very good at providing broad system-level information, such as describing overall math gains among eighth-graders in a state. However, states have responded to criticisms of summative tests by creating more comprehensive “assessment systems”—often incorporating regular interim assessments, offering choice among assessment products, providing topical or predictive assessments for educators (typically not used in accountability), and allowing or promoting districts/schools to develop or adopt assessments of their own. Some states allow for district-level accountability measures to supplement state- or federal-level accountability. Most continue to use end-of-year summative assessments, but their emphasis within the overall system is being reduced.

Interim Tests

For purposes of this blog, we’ve narrowed the definition of “interim” to include only statewide assessments administered within at least two specific windows of time throughout the academic year. Usually, a state will administer a fall, winter, and spring test if they administer interims, but there are variations.

There are two main types of interim assessments. The first is a “mini-summative,” which looks just like the end-of-year summative but may be shorter. It uses the same assessment blueprint (samples from the full set of academic standards) and generates scores like a summative assessment, which may include proficiency ratings, scale scores, and sub-scores.

The second type of interim assigns specific content to each assessment. This type requires that content be taught in specific blocks throughout the year to match the testing schedule. Scores from each interim assessment are more like sub-scores on a summative assessment.

Lessons Learned

HumRRO has conducted evaluative work on both types of interim assessments, and while some issues are common across the two, they each have unique challenges. Here are a few lessons we have learned evaluating interim assessments:

  • Lesson #1: Shifting from a summative assessment to interim assessments does not automatically make the assessment information more diagnostic/formative.

    Most interim assessments are similar in structure to summative assessments, meaning that they do not provide more, or more nuanced, information. This is especially true for shortened versions of summative assessments. These may be predictive of end-of-year performance, but they rarely give educators the detailed information they can use in the classroom to adjust instruction and meet individual students’ learning needs.

    Interims that measure less content may provide more specific information, but because they typically sample from a broad set of content standards, scores are still not diagnostic. These types of interims provide information akin to sub-scores on summative tests, but they may be more relevant because their timing coincides with the topics addressed in the curriculum. Truly diagnostic tests—those that identify specific student learning needs—must either be very narrow in scope or extremely long. Current interim assessments are neither.

  • Lesson #2: Assigning annual proficiency ratings based on interim assessments is more complex than adding up scores.

    “Can’t we just add up the interim scores to get a final score?” is a common, if naïve, question. The answer might be, “yes,” if each interim assessment measures separate, but equally important, content. Then, an overall score might be represented by an average, or each interim score might be considered separately. For example, proficiency could be computed three times, rather than once—and reported as such—which may be more useful for educators but may not meet regulations requiring an overall annual proficiency determination.

    If each interim assessment measures the same content, however, averaging the scores makes even less sense. If students are tested on the same content at different times throughout the year, we would expect their scores to increase as they receive more instruction. A very low early score may obscure a student’s improvement in a subject, and the best estimate of a student’s proficiency at the end of the year is from an assessment given at that time. Most states that use these types of interims assign proficiency ratings based only on the final interim assessment, making it function as a summative assessment. Earlier interims are scored and used by educators to signal concerns prior to the spring interim administration.

  • Lesson #3: Including complex item types is at least as difficult on interims as it is on summative assessments.

    A major complaint about large-scale testing is that the scores arrive too late, often after the students have moved on to the next grade or school. Interim tests can provide more timely information during the academic year, but there are constraints around how timely scores can be reported. Scores need to arrive shortly after test administration to be useful in classrooms. This means that complex item types, some of which are necessarily scored by humans, are impractical. Advances in automated scoring (artificial intelligence (AI) scoring) can help, but they introduce other issues, such as training the scoring engines and ensuring comparability with human scorers. Immediate reporting requires machine scoring, which often means compromises regarding test format, item types, and item complexity. Interim assessments add to reporting time pressure by increasing the number of score reports produced each academic year, which may further reduce the time available to include complex item types.

  • Lesson #4: Assessment literacy is even more important when a state implements a system of assessments.

    The promise of more frequent and useful assessment information for educators must coincide with commensurate improvements in educators’ assessment literacy. The most appropriate uses of summative assessment data are at the system level. Summative assessments can point to trends in student performance at the school, district, or state level. They can inform policy related to school or district funding. They can signal changes in the performance of student groups.

    Teachers complain that summative data are not useful for addressing individual students’ learning needs, and they are largely correct in that assertion. If, however, teachers receive student assessment scores three times or more per year, they may begin to use the data in different ways. Some of these uses may be appropriate, but some may not. Remember from Lesson #1 that interim assessments are not automatically diagnostic, so making instructional decisions for individual students based on those scores may not be appropriate.

    It is vital that educators understand how to use data from every part of an assessment system appropriately. Otherwise, teachers may draw incorrect conclusions about student learning and adapt their instructional strategies or curriculum ineffectively. More data does not automatically mean better practice, and if assessment literacy is poor, it could have the opposite effect.

  • Lesson #5: Evaluations of interims should focus on the intended and required uses of the assessments.

    It is easy to get caught up in summative assessment bashing, and the promise of something, anything, different can be appealing. However, it is important to keep in mind that implementing interim assessments will not address all the complaints about summative assessments.

    States must approach interim assessments thoughtfully, with a clear understanding of how they are best used and fit the state’s needs and legislative requirements. A strong theory of action (TOA) can help states communicate how they intend for the interims to be used, and a TOA provides a good basis for an evaluation. Strong evaluations must go beyond simple test quality and investigate the utility of interim assessment data for its stated purposes, including use by schools and teachers.

    The goal of education is to improve the learning outcomes of students, so it is vital to ensure that any educational tool is used as intended before evaluating that tool’s effectiveness.

This is the first in a three-part blog series highlighting HumRRO’s experience evaluating state K-12 assessment systems and exploring some of our early lessons learned. The next two blogs, due to be published later this week, focus on formative/diagnostic assessments and competency-based local assessments.

About the Authors:

Art Thacker - Chief Scientist

Art Thacker,
Ph.D.

Chief Scientist

Monica Gribben, Ph.D. - Principal Scientist

Monica Gribben,
Ph.D.

Principal Scientist