How do Tests Measure Up?
The modern educator’s life often feels as if it is driven by test results. Test scores are now used to compare students, to compare and give grades to schools, and even to compensate teachers. But have we staked too much importance on test scores? Harvard professor Daniel Koretz, who teaches educational measurement, has taken on the task of educating us all on what test scores can tell us – and what they cannot.
His book, Measuring Up, was published recently by Harvard University Press. I read the book, and Dr. Koretz answered my questions below.
1. What is the problem with teaching to the test? If the tests and standards are sound, what is the problem?
Many people believe that if you have a high-quality test aligned with sensible standards, teaching to the test must be fine. They are wrong. Some degree of teaching to a good test is desirable. If a test shows that your students are quite weak in dealing with proportions, you ought to bolster your teaching of proportions. That’s why we test. But even with a test aligned to solid standards, there is a very real risk of excessive or inappropriate teaching to the test. To see why, you have to go back to ground zero, to the basic principles of testing.
As I explain in more detail in Measuring Up, a test is a small sample of behavior that we use to estimate mastery of a much larger “domain” of achievement, such as mathematics. In this sense, it is very much like a political poll, in which the preferences of a very small number of carefully chosen people are used to estimate the likely voting of millions of others. In the same way, a student’s performance on a small number of test items is used to estimate her mastery of the larger domain. Under ideal circumstances, these small samples, whether of people or of test items, can work pretty well to estimate the larger quantity that we are really interested in.
However, when the pressure to raise scores is high enough, people often start focusing too much on the small sample in the test rather than on the domain it is intended to represent. What would happen if a presidential campaign devoted a lot of its resources to trying to win over the 1,000 voters who participated in a recent poll, while ignoring the 120 million other voters? They would get good poll numbers if the same people were asked again, but those results would be no longer represent the electorate, and they would lose. By the same token, if you focus too much on the tested sample of mathematics, at the expense of the broader domain it represents, you get inflated scores. Scores no longer represent real achievement, and if you give students another measure—another test, or real-world tasks involving the same skills—they don’t perform as well. And remember, we don’t send kids to school so that they will score well on their particular state’s test; we send them to school to learn things that they can use in the real world, for example, in later education and in their work.
Nothing in this process requires ‘bad’ material on the test. The test can be very high quality and well aligned with solid content standards. The risk of score inflation requires only two things: that the test be a small sample of the domain, and that important aspects of the test are somewhat predictable. The first is always true of tests that measure large domains, and the second usually is.
So, what how can you teach appropriately to a well-designed test? I suggest two guidelines. First, focus on the big picture, the knowledge and skills the test is intended to represent, not the details of particular test items. Second, ask yourself whether your forms of test prep will give kids knowledge and skills that they can readily apply, not just on your state’s test, but also in novel situations, including other tests as well as real-world applications. If your answer to the second question is ‘no,’ alignment will be no protection against score inflation, and you need to change how you teach to the test.
2. Why shouldn’t you use test scores to tell which schools are doing better than others?
Two reasons: scores reflect some things you don’t want to count and exclude others that you should. Many things other than school quality—for example, students’ family background—strongly influences on test scores. If one school scores higher than a second, that difference may reflect educational quality, irrelevant non-educational factors, or both.
The second reason, less often discussed, is the incompleteness of achievement tests. Even a very good test measures only a modest proportion of what we value. Schooling has goals other than achievement, such as motivation to learn and to an willingness to apply what one has learned in the real world. Tests don’t measure these. Most testing systems measure only some subject areas. In my state, a student must pass tests in mathematics, English, and science to get a diploma, but the state does not test history or government. Within tested subjects, such as mathematics, we test some content but not all, leaving out some important but hard-to-test material. Therefore, tests provide limited, specialized information about student performance—very valuable, but not comprehensive.
In the current context, another reason is score inflation. Inflation can be very large, and it tends to vary a great deal among schools. That means that sometimes, schools that seem to be doing very well are simply coaching more effectively. If you were to use an uncorrupted measure of learning, some high-scoring schools would not look so good.
3. What do you think about the Value Added Methods gaining prominence as means of measuring the contribution of individual teachers to student achievement?
Value-added methods, which evaluate how much the achievement of individual students has grown over a period of time, is in several ways a big improvement over the alternatives. First, value added is more appropriate. It makes more sense to hold educators accountable for growth while students are on their watch then to hold them accountable for students’ average scores or the percentage ‘proficient,’ both of which reflect in large measure what students bring with them when they enter a grade. Second, value-added models do a better job—but generally not a complete job—of controlling for the non-educational factors that influence achievement.
However, value-added methods are by no means a silver bullet, and they pose some very serious difficulties of their own. For example, estimates of growth in individual classrooms in a single year are generally very imprecise, which is to say that they bounce around a good bit because of irrelevant factors. The result is that in any given year, many teachers will be misclassified. A second problem is that the statistical models employed are complex, and the field has not yet agreed which methods are best. These different methods can rate teachers differently. The rankings can be quite sensitive to decisions made about how to test a subject or even how to scale test scores. Even at their best, value-added methods tell us only how much students have grown; we can’t be confident about the share of that growth that is properly attributed to the effects of the teacher. And value-added methods do nothing whatever to address the core problems of poorly designed test-based accountability: inappropriate test prep and score inflation.
The literature on value-added modeling is highly complex and is difficult for people without a very strong statistical background to understand. In response, I recently wrote an article that explains the pros and cons of value-added approaches in plain English, which you can download here.
4. In California, we have a high school exit exam that all students, including those with special needs, must pass to gain their diploma. Last year 46% of special needs students failed this test. The State Superintendent said: “Special-education students deserve a diploma that has real value and real meaning.” What would you say?
The principle is right, but the policy is problematic. The policy came about because advocates for students with special needs argued that if we don’t hold schools accountable for their achievement—rather than just for their placement and so on—many of them will be shortchanged and will not live up to their potential. As a former special education teacher, I strongly agree with that argument. However, the simple fact is that kids differ tremendously in terms of their achievement. This is true throughout the world, even in more equitable societies, and it is true of all kids, not just those with special needs. Some students, even given ideal educations, will perform relatively poorly on achievement tests.
This leaves us with a dilemma. Let’s say a student with a substantial disability—one that would lead to a prediction of low scores—gets a good education, works very hard, and ends up scoring much better than expected, only a few points below the “proficient” standard. What is the best thing to do? Is it better to give him the same diploma as kids who passed the test? A diploma that is in some way different? Or no diploma at all? This is policy problem, not a technical one. Personally, I would choose the second option over the third.
5. Your book suggests we should approach setting goals “that reflect realistic and practical expectations for improvement.” Do you have suggestions as to how educators and policymakers might approach this process?
We have to stop setting entirely arbitrary performance targets, and we have to stop insisting that the same targets are appropriate for all schools under all circumstances. We should aim for moderate rates of progress over the moderate term, allowing for bad years as well as good. There are a number of ways we can get information about what is realistic. For example, we can look at historical trends, at evaluations of particular programs, and at the progress shown by exemplary schools (being careful to avoid mistaking score inflation, which can be very rapid, from meaningful gains in learning).
However, I should add one more caution. Our current policies assume that test scores are sufficient and that there is no need for any human judgment in the evaluation of schools. I think that is a serious mistake, and it affects the setting of targets as well. Consider two schools that have identical, unacceptably low scores on their state test. The first school has a highly transient and severely disadvantaged student population, with many students who arrive not speaking English. The second school has none of these disadvantages: it is in a stable community, and almost all of its students are native speakers of English. Wouldn’t it make sense to expect more rapid gains from teachers in the second school?
What do you think of what Dr. Koretz has said? How should we be using information from tests? What current practices should we be challenging?