What Does Educational Testing Really Tell Us? An Interview with Daniel Koretz
EW: What are the three most common misconceptions about educational testing that Measuring Up hopes to debunk?
DK: There are so many that it is hard to choose, but given the importance of NCLB and other test-based accountability systems, I'd choose these:
* That test scores alone are sufficient to evaluate a teacher, a school, or an educational program.EW: I'm intrigued by your third point about alignment. For example, we often hear that because state testing systems are directed towards a particular set of standards, we should primarily be concerned with student outcomes on tests aligned with those standards. This is the common refrain about a "test worth teaching to." What's missing from this argument?
* That you can trust the often very large gains in scores we are seeing on tests used to hold students accountable.
* That alignment is a cure-all - that more alignment is always better, and that alignment is enough to take care of problems like inflated scores.
DK: Up to a point, alignment is a clearly good thing: we want clarity about goals, and we want both instruction and assessment to focus on the goals deemed most important.
However, there are two flies in the ointment. The first is that the achievement tests are concerned with, no matter how well aligned, are small samples from large domains of performance. That means that most of the domain, including much of the content and skills relevant to the standards, is necessarily omitted from the test. As I explain in Measuring Up, this is analogous to a political poll or any other survey, and it is not a big problem under low-stakes conditions. Under high-stakes conditions, however, there is a strong incentive to focus on the sampled content at the expense of the omitted material, which causes score inflation. Aligned tests are not exempt. Score inflation does not require that the test include poorly aligned content. Even if the test is right on target, inflation will occur if the accountability program leads people to deemphasize other material that is also important for the conclusions based on scores. And to make this concrete: some of the most serious examples of score inflation in the research literature were found in Kentucky's KIRIS system, which was a standards-based testing program.
The second problem is predictability. To prepare students in a way that inflates scores, you have to know something about the test that is coming this year, not just the ones you have seen in the past. The content, format, style, or scoring of the test has to be somewhat predictable. And, of course, it usually is, as anyone who has looked at tests and test preparation materials should know. Carried too far, alignment actually makes this problem worse, by focusing attention on the particular way that knowledge and skills are presented in a given set of standards. Think about 'power standards,' 'eligible standards,' and 'grade level expectations,' all of which can be labels for narrowing in on the specifics of how a set of skills appear on one state's particular assessment.
Why is this bad? Because many of those specifics are not relevant to the students' broader competence and long-term well-being. Scores on a test are a means to an end, not properly an end in themselves. Education should provide students knowledge and skills that they can use in later study and in the real world. Employers and university faculty will not do students the favor of recasting problems to align with the details of the state tests with which they are familiar. As Audrey Qualls said some years ago: real gains in achievement require that students can perform well when confronted with "unfamiliar particulars." Improving performance on the familiar but not the unfamiliar is score inflation.
EW: What are the implications of score inflation for both measuring and attenuating achievement gaps? Because schools serving disadvantaged students face more pressure to increase test scores via the mechanisms you describe, I worry that true achievement gaps may be unchanged - or even growing - while they appear to be closing based on high-stakes measures.
DK: I share your worry. I have long suspected that on average, inflation will be more severe in low-achieving schools, including those serving disadvantaged students. In most systems, including NCLB, these schools have to make the most rapid gains, but they also face unusually serious barriers to doing so. And in some cases, the size of the gains they are required to make exceed by quite a margin what we know how to produce by legitimate means. This will increase the incentive to take short cuts, including those that will inflate scores. This would be ironic, given that one of the primary rationales for NCLB is to improve equity. Unfortunately, while we have a lot of anecdotal evidence suggesting that this is the case, we have very few serious empirical studies of this. We do have some, such as the RAND study that showed convincingly that the "Texas miracle" in the early 1990s, supposedly including a rapid narrowing of the achievement gap, was largely an illusion. Two of my students are currently working with me on a study of this in one large district, but we are months away from releasing a reviewed paper, and it is only one district.
I have argued for years that one of the most glaring faults of our current educational accountability systems is that we do not sufficiently evaluate their effects, instead trusting - evidence to the contrary - that any increase in scores is enough to let us declare success. We should be doing more evaluation not only because it is needed for the improvement of policy, but also because we have an ethical obligation to the children upon whom we are experimenting. Nowhere is this failure more important than in the case of disadvantaged students, who most need the help of education reform.
Inflation is not the only reason why we are not getting a clear picture of changes in the achievement gap. The other is our insistence on standards-based reporting. As I explain in Measuring Up, relying so much on this form of reporting has been a serious mistake for a number of reasons. One reason is that if one wants to compare change in two groups that start out at different levels - poor and wealthy kids, African American and white kids, whatever - changes in the percents above a standard will always give you the wrong answer. This particular statistic confuses the amount of progress a group makes with the proportion of the group clustered around that particular standard, and the latter has to be different for high- and low-scoring groups. I and others have shown that this distortion is a mathematical certainty, but perhaps most telling is a paper by Bob Linn that shows that if you ask whether the achievement gap has been closing, NAEP will give you different answers - very different answers - depending on whether you use changes in scale scores, changes in percent above Basic, or changes in percent above Proficient. This is not because the relative progress has been different at different levels of performance; it is simply an artifact of using percents above standards. This is only one of many problems with standards-based reporting, but in my opinion, it is by itself sufficient reason to return to other forms of reporting.