« California Schools Hit the NCLB Wall | Main | Thinking Big: What’s Next for Teachers? »

# How do Tests Measure Up?

The modern educator’s life often feels as if it is driven by test results. Test scores are now used to compare students, to compare and give grades to schools, and even to compensate teachers. But have we staked too much importance on test scores? Harvard professor Daniel Koretz, who teaches educational measurement, has taken on the task of educating us all on what test scores can tell us – and what they cannot.

His book, Measuring Up, was published recently by Harvard University Press. I read the book, and Dr. Koretz answered my questions below.

1. What is the problem with teaching to the test? If the tests and standards are sound, what is the problem?

Many people believe that if you have a high-quality test aligned with sensible standards, teaching to the test must be fine. They are wrong. Some degree of teaching to a good test is desirable. If a test shows that your students are quite weak in dealing with proportions, you ought to bolster your teaching of proportions. That’s why we test. But even with a test aligned to solid standards, there is a very real risk of excessive or inappropriate teaching to the test. To see why, you have to go back to ground zero, to the basic principles of testing.

As I explain in more detail in Measuring Up, a test is a small sample of behavior that we use to estimate mastery of a much larger “domain” of achievement, such as mathematics. In this sense, it is very much like a political poll, in which the preferences of a very small number of carefully chosen people are used to estimate the likely voting of millions of others. In the same way, a student’s performance on a small number of test items is used to estimate her mastery of the larger domain. Under ideal circumstances, these small samples, whether of people or of test items, can work pretty well to estimate the larger quantity that we are really interested in.
However, when the pressure to raise scores is high enough, people often start focusing too much on the small sample in the test rather than on the domain it is intended to represent. What would happen if a presidential campaign devoted a lot of its resources to trying to win over the 1,000 voters who participated in a recent poll, while ignoring the 120 million other voters? They would get good poll numbers if the same people were asked again, but those results would be no longer represent the electorate, and they would lose. By the same token, if you focus too much on the tested sample of mathematics, at the expense of the broader domain it represents, you get inflated scores. Scores no longer represent real achievement, and if you give students another measure—another test, or real-world tasks involving the same skills—they don’t perform as well. And remember, we don’t send kids to school so that they will score well on their particular state’s test; we send them to school to learn things that they can use in the real world, for example, in later education and in their work.
Nothing in this process requires ‘bad’ material on the test. The test can be very high quality and well aligned with solid content standards. The risk of score inflation requires only two things: that the test be a small sample of the domain, and that important aspects of the test are somewhat predictable. The first is always true of tests that measure large domains, and the second usually is.
So, what how can you teach appropriately to a well-designed test? I suggest two guidelines. First, focus on the big picture, the knowledge and skills the test is intended to represent, not the details of particular test items. Second, ask yourself whether your forms of test prep will give kids knowledge and skills that they can readily apply, not just on your state’s test, but also in novel situations, including other tests as well as real-world applications. If your answer to the second question is ‘no,’ alignment will be no protection against score inflation, and you need to change how you teach to the test.

2. Why shouldn’t you use test scores to tell which schools are doing better than others?

Two reasons: scores reflect some things you don’t want to count and exclude others that you should. Many things other than school quality—for example, students’ family background—strongly influences on test scores. If one school scores higher than a second, that difference may reflect educational quality, irrelevant non-educational factors, or both.
The second reason, less often discussed, is the incompleteness of achievement tests. Even a very good test measures only a modest proportion of what we value. Schooling has goals other than achievement, such as motivation to learn and to an willingness to apply what one has learned in the real world. Tests don’t measure these. Most testing systems measure only some subject areas. In my state, a student must pass tests in mathematics, English, and science to get a diploma, but the state does not test history or government. Within tested subjects, such as mathematics, we test some content but not all, leaving out some important but hard-to-test material. Therefore, tests provide limited, specialized information about student performance—very valuable, but not comprehensive.
In the current context, another reason is score inflation. Inflation can be very large, and it tends to vary a great deal among schools. That means that sometimes, schools that seem to be doing very well are simply coaching more effectively. If you were to use an uncorrupted measure of learning, some high-scoring schools would not look so good.

3. What do you think about the Value Added Methods gaining prominence as means of measuring the contribution of individual teachers to student achievement?

Value-added methods, which evaluate how much the achievement of individual students has grown over a period of time, is in several ways a big improvement over the alternatives. First, value added is more appropriate. It makes more sense to hold educators accountable for growth while students are on their watch then to hold them accountable for students’ average scores or the percentage ‘proficient,’ both of which reflect in large measure what students bring with them when they enter a grade. Second, value-added models do a better job—but generally not a complete job—of controlling for the non-educational factors that influence achievement.
However, value-added methods are by no means a silver bullet, and they pose some very serious difficulties of their own. For example, estimates of growth in individual classrooms in a single year are generally very imprecise, which is to say that they bounce around a good bit because of irrelevant factors. The result is that in any given year, many teachers will be misclassified. A second problem is that the statistical models employed are complex, and the field has not yet agreed which methods are best. These different methods can rate teachers differently. The rankings can be quite sensitive to decisions made about how to test a subject or even how to scale test scores. Even at their best, value-added methods tell us only how much students have grown; we can’t be confident about the share of that growth that is properly attributed to the effects of the teacher. And value-added methods do nothing whatever to address the core problems of poorly designed test-based accountability: inappropriate test prep and score inflation.
The literature on value-added modeling is highly complex and is difficult for people without a very strong statistical background to understand. In response, I recently wrote an article that explains the pros and cons of value-added approaches in plain English, which you can download here.

4. In California, we have a high school exit exam that all students, including those with special needs, must pass to gain their diploma. Last year 46% of special needs students failed this test. The State Superintendent said: “Special-education students deserve a diploma that has real value and real meaning.” What would you say?

The principle is right, but the policy is problematic. The policy came about because advocates for students with special needs argued that if we don’t hold schools accountable for their achievement—rather than just for their placement and so on—many of them will be shortchanged and will not live up to their potential. As a former special education teacher, I strongly agree with that argument. However, the simple fact is that kids differ tremendously in terms of their achievement. This is true throughout the world, even in more equitable societies, and it is true of all kids, not just those with special needs. Some students, even given ideal educations, will perform relatively poorly on achievement tests.

This leaves us with a dilemma. Let’s say a student with a substantial disability—one that would lead to a prediction of low scores—gets a good education, works very hard, and ends up scoring much better than expected, only a few points below the “proficient” standard. What is the best thing to do? Is it better to give him the same diploma as kids who passed the test? A diploma that is in some way different? Or no diploma at all? This is policy problem, not a technical one. Personally, I would choose the second option over the third.

5. Your book suggests we should approach setting goals “that reflect realistic and practical expectations for improvement.” Do you have suggestions as to how educators and policymakers might approach this process?

We have to stop setting entirely arbitrary performance targets, and we have to stop insisting that the same targets are appropriate for all schools under all circumstances. We should aim for moderate rates of progress over the moderate term, allowing for bad years as well as good. There are a number of ways we can get information about what is realistic. For example, we can look at historical trends, at evaluations of particular programs, and at the progress shown by exemplary schools (being careful to avoid mistaking score inflation, which can be very rapid, from meaningful gains in learning).
However, I should add one more caution. Our current policies assume that test scores are sufficient and that there is no need for any human judgment in the evaluation of schools. I think that is a serious mistake, and it affects the setting of targets as well. Consider two schools that have identical, unacceptably low scores on their state test. The first school has a highly transient and severely disadvantaged student population, with many students who arrive not speaking English. The second school has none of these disadvantages: it is in a stable community, and almost all of its students are native speakers of English. Wouldn’t it make sense to expect more rapid gains from teachers in the second school?

What do you think of what Dr. Koretz has said? How should we be using information from tests? What current practices should we be challenging?

It seems as though Dr. Koretz, and many others, would argue for a "sliding scale" of achievement expectations. Particularly in his response to your last question, in which he calls the current proficiency targets "arbitrary" and setting more individuals goals for progress on a school-by-school basis, with heavy weight given to socio-economics, etc.

Certainly we still have people living in America today who were educated under such policies--some blatantly stated--that education at the highest levels ought to be provided only to the elite, as defined by socio-economics, gender and race. There are those who can tell of being steered into business math rather than algebra, home economics rather than mechanical drawing and general science rather than biology. In each case, someone had pre-determined what was "best" or most appropriate based on the kinds of qualities that Dr. Koretz suggests: economic disadvantage, transience, first language, as well as assumptions about the life course of the student.

I suspect that while many teachers will cheer such Dr. Koretz suggestion, very few would be comfortable sending their own children to a school that is only expected to provide an education "good enough" for a low income population.

Margo,
We are seeing the results of the arbitrary standards set by No Child Left Behind in California, where almost all schools will soon be designated as failing. Do you disagree that these standards are arbitrary? Or do you believe we must judge all schools based on one measurement: performance on the state test?

Thanks for posting a good interview, Mr. Cody. I call "external validity checks" what you, Koretz and MargoMom name "arbitrary."

In answer to your last question of MargoMom, there's also a different intellectually consistent position from the one Koretz describes: Yes, legislatures have a fiduciary duty to confirm that public funds buy what was authorized: increased student academic achievement that meets approved standards. Academic performance standards serve as a way to judge schools against one or more external measurements of academic performance. To avoid the appearance of conflict of interest, educators should not be involved beyond contributing technical assessment development assistance during formulation and conduct those assessments.

To continue this alternative point, teachers have known for a century of uses of standardized assessments how to handle such test successfully with most students, including with students coming from the most disinfranchised families, however that's defined. During more recent decades, teachers also have learned what to do for most students with special needs to meet them also.

Continuing the point further, the interview missed a crucial question: What should legislatures do to increase confidence that teachers fulfill an implied obligation to assist all students to perform better on external validity academic performance checks even when teachers don't choose to do so for their own reasons.

While I understand both of these views, I see value and deficites in each for optimizing academic performance of each student in a public school. When in doubt, I choose, "Let's make what's expected work and then discuss alternatives."

Dear Bob,
I did not become a teacher to "sell" the legislature or any other client increased student academic achievement. I think one of Dr. Koretz' most important points is this: "Even a very good test measures only a modest proportion of what we value. Schooling has goals other than achievement, such as motivation to learn and to an willingness to apply what one has learned in the real world." I entered teaching to improve the lives of my students, to increase their future opportunities, to open their eyes to their own abilities and to the wonders of the natural world.

I am capable of teaching them certain content, aligned with standards set by the legislature or state board of education, and I do not have a problem with external tests that check to see how well the students and I are doing in this regard. What I find arbitrary are the absolute judgments that are made based on this single dimension, when there are many other dimensions I believe are important.

I do not see the legislature as my client. I see my students and their parents, and the community in which we live as my clients. I believe that they share my sense that the reliance on standardized test scores has gone overboard, and that there are other things that are also important to consider.

I think teachers have a challenge in describing to the public at large these other dimensions of learning, because in the absence of this awareness the criticism of test scores might appear to be an abdication of responsibility for results. But teachers that I work with are powerfully motivated to serve their students' interests. We simply object to our students being judged on the basis of one set of tests a year, when so much more is going on that should be valued.

Accountability for education in schools is one thing and accountability for society's ills is another. NCLB and state standardized testing assumes that all students learn the same and therefore should test the same. As you should know, this is not true. Many theories in education indicate that student learning is not "one for all". After all, Maslow's Heirarchy of Needs and Howard Garner's Multiple Intelligences indicate that students needs are not being met when we consistently try to get students to learn "one way" and in a style not meant for some. There are multiple learning styles, multiple teaching styles, and multiple means of assessing knowledge. All of this accountability issue stems from legislators that base their assumptions on "A Nation at Risk". From this and other sources (dissertations, articles), they have come to the conclusion that the United States is behind in Math and Science all around the world. This maybe true, if you base it on data from the other countries. But how accurate is this data? After all, it is so obvious that those Chinese gymnasts were under age. This was state-sponsored cheating in my opinion. To me this shows just what other countries will do to show the world just how "good" they are. Just look around the world. Arguably, we have the best Technology and Space Program in the world. The Chinese? You mean to tell me that they are better in Math than the USA? How long ago did we go to the moon? In my opinion, we really need to deemphasize standardized testing and concentrate more on educating our youth and giving us the resources to work at diminishing gangs, violence, drugs, unemployment, and drop outs.

Anthony:

And have you stopped beating your wife?

No, I don't believe in judging schools "on a single test." But they are not in fact being judged "on a single test." Even if we confine our discussion to the testing indicators on school report cards, there are multiple tests, at multiple grade levels, and generally reported over time. In addition, there are mutiple ways of reporting these results (aggregated, disaggregated, performance indicators, safe harbor calculations, etc). States are required to select one non-academic indicator--which for most is attendance (and graduation at the high school level).

I also value motivation to learn and preparation to apply learning "in the real world." These things also have indicators, or proxies--although not every data point is collected at the school level. Certainly graduation/drop-out rates are indicators of the motivation to learn (as well as attendance and discipline), as would be the rates of continuation to college or other training, and employment following graduation. Some high school programs (High Schools that Work, for example) follow these things more than others. I would suggest to you, however, that schools that are doing well on these indicators are also having fewer problems with the academic indicators.

But, I put it to you. Can you honestly say that removing the academic indicators would benefit the kids on the lower rungs of the socio-economic scale? Would they emerge from education with more, or fewer, of the opportunities available to their more advantaged peers?

Margo,
You ask me: "Can you honestly say that removing the academic indicators would benefit the kids on the lower rungs of the socio-economic scale?"

To which I reply: please read what I wrote. Neither Dr. Koretz or I have argued that tests are useless and should be done away with. In my last comment I wrote:

"I am capable of teaching them certain content, aligned with standards set by the legislature or state board of education, and I do not have a problem with external tests that check to see how well the students and I are doing in this regard."

I am suggesting that using one set of tests (not a "single test") as the primary indicator of the effectiveness of a school is misguided. I believe Dr. Koretz has provided numerous reasons why this is so. I do not think the inclusion of high school graduation rates is an adequate expansion into the other dimensions of learning that I think are important.

There is evidence that the current practice of restructuring schools that have missed their growth targets for several years in a row is not working. And as we head towards a time when the vast majority of schools will be missing their targets due to the absurd growth expectations in NCLB, it is time to revisit the entire system and find ways to restructure schools that really work. Part of that means a redefinition of the ways that we measure success, because the narrow focus on test scores is missing so much that is important.

Anthony,

It is very obvious that you don't teach in East LA or anywhere for that matter where at-risk students are the predominent population. When the time roles around when everyone is supposed to be at 100%, what will the nation do then? Take over the schools? Education has always been the responsibility of the state not the Federal government. NCLB is largely an unsuccessful non-funded federal mandate that is going to create an educational system full of cheaters and people willing to do what ever it takes to keep their jobs.

Peter,
I do indeed work in a community not too different from the one you describe. My current role is as a coach working with teachers.

If you read my comments and posts a bit more carefully, you will see we agree much more than you think. I am hardly a defender of NCLB. If you are interested in my own experiences with the effects of NCLB, you can read something I wrote last spring describing them here: http://www.edutopia.org/standardized-testing-NCLB-school-suffers

Please, receive my sincerest apology. You do indeed realize what is really going on with NCLB. I too, would like to say that I believe that standardized testing can be useful if conducted in the right context. USDE needs to realize that the very schools (like yours and mine)are really the ones that are assisting the most at-risk students and being successful (not to their standards-but successful). If these schools are shut down, where will the educational system go from there? Most importantly, where will the students go...prison?

In my opinion, schools that show successes working with these students should be congratulated and given assistance to be more successful. But in order to be successful with these students it takes time (not one year), effort, additional staffing, and additional funding coupled with an accountability system that will recognize that fact and support the efforts of thousands of teachers and administrators working to improve the educational process. Again, I apologize for my misunderstanding of your position and wish you continued success.

re: Question 1. What is the problem with teaching to the test? If the tests and standards are sound, what is the problem?

Who agrees that the tests and standards are sound? I have a philosophical issue with 60 questions on US history, for example, where nearly every one requires a student to have memorized a fact of rushed curriculum, (We are supposed to teach the entire year's history standards by April 29th this year in my district, our scheduled social science CST date.) And come on, how many questions in history only have one answer?

I have taught every level of English and social science from LEP, SDAIE to AP and IB. The Spec Ed dept. at my school puts their kids into my classes because they know they are welcome there and will receive an education geared to their learning styles and abilities. This is my 18th year teaching and I'm through!

Teaching to the test is ok if it's a great test. The CST's are not. Apparently, we do not want to pay for the 'grading' of tests that might actually reflect what our students know and can do. There was such an experiment a few years back called CLAS. I worked for a chunk of the summer as a reader of the 8th grade CLAS English tests. They were wonderful! But someone had to pay me and multiple choice scantrons are much cheaper.

My favorite tests to teach to are the IB History tests. All ESSAY and full of student choice about which questions to write to. They ask the kids to show their strengths in knowledge, analysis, even synthesis. I'm sure they cost the International Baccalaureate a mint to assess (which, of course, they pass along to member schools), but what a difference there is in my day between my IB history classes and the my regular college prep US history classes.

And the part I can no longer stomache? That my less academically prepared kids are getting the crappy education! The one based on memorization and speeding through curriculum so we can "cover" as much as possible of what might be on the test. It's awful. They deserve so much better than this. I can't do it to them any more. (Or to me!)

I see former students all the time who say things like "Remember that cool Supreme Court thing we did to argue one of the cases back in government class? I never forgot that. It helped me understand how it's supposed to work." TODAY, I'd never take a week to prep a class for such a thing. Are you kidding? Then we'd be even futher behind on our memorizing!

I know which is the right way and which isn't. This isn't. As my district rolls out the standardized benchmark assessments in all subject areas this year, I look at my teacher friends' schools in Marin County who have API's in the low 900's and who have no idea what I'm talking about when I complain about what NCLB has done to my students and my career.

I understand the need to know how we're doing. I just beg to differ that this is not the way to find out.

The root of the problem is that national standards and national accountability just doesn't work. NCLB has done a great job of proving that point. Get rid of the DOE! We need local control and local accountability. The parents and teachers know the most about what their kids need. Our arrogant government needs to take their hands off and give power back to the people. It's the American way.

Well said, Tim!

Lisa:

As the parent of a "less-prepared" student--by virtue of years in "special" education--I can attest that my child was getting a "crappy education" before he was required to be included in the tests and scores reported for his group. The difference is that no one knew (or admitted it) before. Yes--the tests require some content knowledge. Previous to NCLB criterion referenced testing, the assumption was that teaching math and reading was adequate (for students with disabilities), because any science or social studies testing that would be required only tested reading ability. To a great extent, this was true. Take a look at GED testing in those subjects, for example--all based on read a passage, answer questions based on content contained within the passage.

Tim--I would tend to agree that parents and teachers are primary "stakeholders" in determining what kids need. NCLB actually specifies several roles for parents. One is the parent as consumer, able to choose another school if the assigned school is in trouble, and to choose among tutoring providers for assistance. Districts/teachers have not been terribly support of these roles--and generally serve as gatekeepers.

The second role is parent as reform agent. There are specified reports to parents (school/district report cards, School Improvement reports), which are intended to provide parents with background that they need to be effective in holding schools accountable for improvements. Schools receiving Title I funding are then also required (if in school improvement) to involve parents in school improvement planning. This generally gets only cursory attention by schools--and may be met by getting a single hand-picked parent to sign off as having participated in planning (in my district it is paid "parent consultants" who fulfill this role).

So, an important question to ask would be, why are buildings/districts/teachers (who are generally the ones writing the "improvement" plans) so resistent to the idea of parent involvement in school improvement?

Hi Margo/Mom,

I don't think we're in disagreement that assessments are important. My bone of contention is the test itself. This is not the right one.

re: Parents. I haven't seen resistance at my current site to parent involvement. We have many parents of different backgrounds at every meeting. In-services, staff meetings, WASC work,...maybe we're a fluke?

What I DO see is that input from all stakeholders seems to be re-active instead of pro-active. In my district, we are told what we need to do by the district, which has met with consultants for advice, and who are never local teachers OR parents. Then we're told to gather the stakeholders to figure out how to do it. Parents are there, but in this scenario, their input is as useless as that of the teachers. It comes too late in the process!

This might be a district issue in our case, but not including the parents and teachers in the early parts of the discussion and throughout the process is a huge mistake.

I think you and Tim are right.

Comments are now closed for this post.