Educators, like most professionals, are given more responsibility as their expertise and experience grow. Most school principals, content specialists, and district-level personnel have been classroom teachers. Because the summative end-of-year assessments results did not meet their needs when they were teachers, they are understandably skeptical of those tests meeting their needs as educational leaders. Consequently, they look for better options and often adopt commercially available diagnostic/growth assessment systems.

Where Part 1 of this blog series focused on statewide interim assessments, we now turn to commercially available diagnostic/growth assessments. These are usually administered by individual schools or districts, rather than statewide, and therefore, results are not typically used in states’ school accountability determinations. Like statewide interim assessments, the use of diagnostic assessments has grown in recent years as a supplement to summative end-of-year assessments. The hope is that they will provide more individualized feedback for students and teachers.

Many commercial systems are available to educators, but they are extremely variable in terms of their scope and the information and support they provide. The common thread among them is that they provide different and potentially more useful information for educators to use in their efforts to teach relevant content to students compared to end-of-year summative assessments. This assertion should be vigorously investigated and evaluated, both to ensure that the cost of these programs is justified and to verify that educators and their students benefit from participation in them. But because they are not typically given the same attention as the statewide assessments used in federal accountability, those investigations are not given the same scrutiny. Several of the vendors, to their credit, seek external validation and maintain rigorous research agendas for their assessment systems, but they are not subjected to the same peer review as statewide assessments.

Lessons Learned

HumRRO has conducted evaluations for several of these programs, each with its own unique challenges. There are, however, some commonly encountered lessons we have learned while evaluating diagnostic/growth assessments that are worth pointing out.

  • Lesson #1: Sampling can be challenging.

    Recruiting schools and districts to participate in an evaluation can be especially challenging. Unlike statewide initiatives, where the state education agency can recruit or assign participants, participation in these programs requires that schools/districts purchase program components.

    Evaluations of the programs can “feel” like evaluations of teachers, schools, or districts, and many participants are unwilling to subject themselves to such scrutiny, despite assurances that they are not being evaluated as such. The companies behind these programs can also be very cautious with their clients. To prosper, they must maintain a positive relationship with schools and districts, so any request that may dampen their clients’ enthusiasm must be weighed against the potential long-term benefits to the programs. As evaluators, it is important to specify the parameters of the evaluation, including the magnitude of the impacts that you’ll be looking for and a statistical power analysis that defines the sample needed to effectively detect those impacts. The evaluation should cause as little disruption to students and educators as possible, while still rigorously addressing the research questions.

  • Lesson #2: Variability in implementation must be addressed.

    Commercial assessment programs are often modular, where schools and districts may adopt system components. These components may include interim assessments, curricular products, professional development, teaching tools, or other products. Evaluating the effectiveness of these programs often requires isolating specific components or limiting the evaluation to the sample using the full system (all the components). This further complicates sampling but also can impact any “fidelity of implementation” metrics.

    It is rare for a school/district to adopt an entire system all at once, so most have experience with some components before they add others. Even if two schools use the same system components, they may have differing levels of experience and expertise for any given component. Even if an evaluation can limit the sample to “full implementers” it may be difficult to find schools/districts with a common “path” to full implementation. Evaluations of these systems must account for variability in implementation or findings may be suspect and are likely to underestimate the impact of the system.

  • Lesson #3: Customization of commercial assessments introduces additional challenges.

    One of the most appealing things about a commercial assessment system is that it provides information that is interpretable across constituencies (e.g., states, districts). Results can be compared directly, and educators can understand student performance in a larger context than a single classroom or school. Commercial vendors gain economy of scale by standardizing aspects of the system where it makes sense to do so, and they are resistant to change because there is value in being able to track student performance trends across clients.

    However, to comply with state-level policies or preferences, commercial vendors will sometimes customize their assessments and other system components. A common example is the addition or removal of tested content to align with state academic standards. This means that certain topics are tested or not based on the state a student resides in. While most assessment systems are robust and minor inclusions/exclusions do not substantively impact overall scores, they may impact any diagnostic or instructional utility of the scores. Evaluators must be cognizant of differences in the way a system is implemented across the sample studied and address those differences. If a sample is limited to a single state, it is important to note the limitations for generalizing the results beyond that state.

    Evaluating alignment with state content standards is often part of the evaluation of commercial assessment programs. It is important for the evaluator to consider the stated goals of the system and the goals of the state before conducting an alignment study. Too often alignment studies focus on the numbers or proportions of items on an assessment or in an item pool that reflect academic content standards, without regard to the intended inferences made from the test scores. An assessment could have 100% of its items aligned to a state’s academic standards but only directly test a small proportion of those standards. Conversely, a test could assess most of a state’s content standards, but the items could reflect a different level of understanding than is intended by the content (e.g., demanding only recall of information rather than application of content in unfamiliar contexts). The first step in any alignment study should be a comparison of the intended inferences supported by the commercial assessment system versus the inferences the state intends to make.

  • Lesson #4: Commercial assessment vendors want more than assessment evaluation.

    Typically, summative end-of-year assessments are largely evaluated based on their content, reliability, or other “measurement-focused” metrics. Rarely will an evaluator address how educators use scores reports or whether the educators endorse the information for a given purpose. The utility and effectiveness of the reports stemming from those assessments for improving student performance is almost never part of an evaluation.

    Commercial assessment systems are often promoted as more effective alternatives to summative end-of-year tests. They may claim that the assessment data they produce is more specific, timely, and useful for educators. If that is true, one would expect educators to perform better if they use these systems, and consequently that students in these systems would learn more than their peers who do not receive those benefits. Evaluators may be tasked with much more than evaluating the function of the assessments. They may be asked to evaluate educators’ experiences with the assessments and associated tools, or to compare student outcomes. These types of evaluations extend well beyond the typical requirements of federal peer-review and may not be as familiar to state education agency personnel, but they are vital for determining the impact of educational systems on student outcomes.

Caroline Wiley, Ph.D., Principal Scientist at HumRRO, co-authored this blog.

This is the second in a three-part blog series highlighting HumRRO’s experience evaluating state K-12 assessment systems and exploring some of our early lessons learned. The first installment focused on interim assessments. The third blog, to be published tomorrow, will focus on competency-based local assessments.

Art Thacker - Chief Scientist

For more information, contact:

Art Thacker, Ph.D.

Chief Scientist