« Australia's Top Educator Shares a Familar Education Agenda | Main | IES Seeks Strategies to Rescue 'Chronically' Failing Schools »

# Princeton Study Takes Aim at 'Value-Added' Measure

Merit pay for teachers is an idea that seems to be getting increasingly popular these days with politicians on both sides of the aisle. But if performance-pay plans are going to succeed, they require an evaluation system that is widely seen to be fair and accurate.

A lot of policy makers are pinning their hopes on "value-added" measures of student achievement to fill that bill. The thinking is that value-added models provide a fairer measure of teacher effectiveness because they track students' year-to-year learning gains, rather than their absolute levels of achievement. That way, teachers are not getting undeserved blame for the learning deficits that students bring with them to the classroom or undue rewards for being blessed with a classroom of high achievers.

A forthcoming study, however, suggests policy makers might want to think twice before embracing value-added measures of teacher effectiveness. In a paper due to be published in February in the *Quarterly Journal of Economics*, Princeton University economist Jesse Rothstein uses some sophisticated modeling techniques to suggest that such techniques could be based on shaky assumptions.

Using student-testing data from North Carolina, Rothstein makes his case by developing a "falsification" test for value-added models. For example, he wondered, would the model show that 5th grade teachers have effects on their students' test scores in 3rd and 4th grades? Since it's impossible for students' future teachers to cause their previous achievement outcomes, Rothstein reasons, there should be no such effects.

But in fact there were—and they were quite large. Rothstein says this happens because students are not randomly sorted into classrooms. A principal, for example, might assign a large number of students with behavior problems to a teacher who is known to have a way with problem students or parents of high achievers might lobby to get their child in a class with the "best" teacher. When that happens, though, it biases the results of value-added calculations.

If this study sounds familiar, it's because it's been circulating a while and gathering lots of buzz. I'm not yet sure why the finding that a 5th grade teacher seems to *cause* students' 3rd and 4th grade achievement automatically implies that students were not randomly sorted, but I hope to figure that out. Look for a more detailed story from me soon in *Education Week.*

In the meantime, you should know about two other studies that offer some counterpoint to Rothstein's findings. Thomas J. Kane and Douglas O. Staiger, for one, conducted a small experiment in Los Angeles public schools to see if value-added calculations would match the experimental results. They did. See their paper, "Are Teacher-Level Value-Added Estimates Biased?: An Experimental Validation of Non-Experimental Estimates."

A second study, a working paper by Cory Koedel and Julian R. Betts, suggests that the kinds of biases that Rothstein highlights in his paper can be overcome with more complex value-added models.

You can find a summary of Rothstein's paper and the full text on Princeton's Website, where they were posted yesterday.

You don’t understand Rothstein’s logic? It was clearly explained in the following:

“ Rothstein (2009) takes an entirely different approach based on Chamberlain‟s correlated random effects model when testing for student fixed effects in his analysis. In fact, a test for the statistical significance of the student fixed effects in equation fails to reject the null hypothesis of joint insignificance. However, the test is of low power given the large-N, small-T panel dataset structure (typical of most value-added analyses), limiting inference.

Despite these concerns, econometric theory suggests that the inclusion of student fixed effects will be an effective way to remove within-school sorting bias in teacher effects as long as students and teachers are sorted based on time-invariant characteristics. We estimate the within-students model by first-differencing equation and instrumenting for students‟ lagged test.”

Seriously, what is this about - giving theorists opportunities to work out interesting math problems, or using VAM in real world settings? If that is the case, Koedal and Betts support Rothstein. They write:

“Our entire analysis is based on a low-stakes measure of teacher effectiveness. If high stakes were assigned to value-added measures of teacher effectiveness, sufficient safeguards would need to be put in place to ensure that the system could not be gamed ...”

I’ll just give two examples why VAM can’t be used fairly in the real world. Koedal and Betts wrote:

"... on the other hand, our results are encouraging because they indicate that sorting bias in value-added estimation need not be as large as is implied by Rothstein‟s work. A key finding here is that using multiple years of classroom observations for teachers will reduce sorting bias in value-added estimates. This result raises concerns about using single-year measures of teacher value-added to evaluate teacher effectiveness. For example, one may not want to use achievement gains of the students of novice teachers who are in their first year of teaching to make decisions about which novice teachers should be retained."

Tell that to Michelle Rhee.

Real world, can you tell me how the above doesn’t disqualify VAM for evaluation of any teachers?

Real world, schools will belatedly notify teachers with statements like, “according to our multi-year data, we should have fired you three years ago. Please return your salary for the years of ...” Or more likely, the official statement should say, “We regret to inform you that a multi-year analysis shows that we should not have destroyed your career as a result of scores from FY200? ...”

Give me a break! Would people seriously be considering VAM as evaluation tools unless they had another agenda?

Or consider one minor methodological point. Koedal and Betts wrote:

“We include students who repeat the fourth grade because it is unlikely that grade repeaters would be excluded from teacher evaluations in practice.” So they used a data base where: In our original sample of 30,354 students with current and lagged test-score records, only 199 are grade repeaters.” And, “Similarly to Rothstein, the dataset used to estimate equation does not include any school switchers ...”

Real world, how many neighborhood schools have no mobility, and how many teachers have only one repeater in their classload of 150?

So, a model is not worthless if you assume conditions that would never occur in an urban school? And if “reformers” weren’t trying to force urban teachers to have high expectations and end the achievement gap they wouldn’t be wasting their time with this foolishness? Why they have such an unrealistic vision of education is another subject.

Real world, you may not understand how a future teacher effects the current teacher’s performance, but anyone should understand the effects of principals, or central office policies, or a whole range of issues that are completely beyond the influence of the teacher.

This VAM debate would be interesting if we had an ironclad guarantee that it would never be used to influence the real world fate of teachers.

What's going on with Rothstein's work?

* Well, first you need to understand the logic he is trying to test. The theory of VAM is that given the all the testing data we have on these kids, we can calculations that isolate a teacher's impact on his/her students. That is, we can statistically control for demographics, home factors, historical factors, ability and everything. By controlling for everything else, we can isolate the value that each teacher adds.

There are lots of people out there who believe the theory I just summarized above, but there are lots who do not. How do we test it? All that math and numbers can make something look really authoritative, but how do we know it actually works?

* Rothstein came of with an interesting idea. He looked at the theory and said, "OK, if the theory is valid then this math should show an impact of teachers for their student during the year they are in that teacher's class. It might even show an impact for years in the future, too -- after all, a really good teacher has very long term impacts. But no teacher is good enough to impact his/her students test scores *BEFORE* they are his/her students."

Rothstein has offered a test, based on this little thought, to see if if a VAM system works. That is, he is attempting to address that question without going at the potentially controversial logic. Rather, he is taking the the VAM system as it is -- any VAM system. By "falsifiable" he means that it it is a test that can be failed, as opposed to simply passed or "inconclusive."

* So, Rothstein took a system, but instead of using the data from this year, he used the data from last year. That is, he checked if the system would predict that some teachers did a better job on their students before they even got them than others did. Obviously, that it is a nonsensical suggestion. The VAM system should not show any impact by this method, and if it does it must cause us to doubt the output when used on this year's data. That is, if ****it can find a teachers' impact when there is absolutely no chance that there was one -- because it is logically impossible to have an impact on students before they are your students -- then there's no way that we can trust whatever impacts it shows at other times.****

Rothstein has given us all a gift. It is a easy or simple test to apply for any district with a VAM system in place. It would be simple for any district evaluating VAM systems to ask their designers to run this kind of test on their own data. The logic of a failure is very easy to explain -- one does not have to resort to statistics and math to do so.

Bravo, Mr. Rothstein!