What Does the Research Say About Value-Added Models and Teacher Evaluation?
Looks like I touched a nerve.
It's been interesting to read the comments on my last two posts about the use of VAMs as a teacher evaluation tool, both here and on Twitter. They fall into two general categories: comments from teachers glad to see something said about VAMs that they can agree with, and comments from people questioning the credibility of the claims I made—especially with regard to the fact that I seem to have made an assumptive leap on the basis of one pilot study of questionable provenance. I guess I already reached the first group, so I'll focus on the second.
So about that research. Yes, it's true that I focused on one anonymous pilot study put out by a software company in my previous post. That's fair. (Though, for the record, the study was only one part of my argument.) I aim in this space to provide clear and understandable commentary on educational issues, not to bore people with citations to research studies. But there's plenty of research to point to on VAMs and their use in teacher evaluation.
What does it say? Here are a few choice words:
- Jane L. David wrote an article in Educational Leadership (published by ASCD) in 2010 that focused on two questions related to the use of VAMs in teacher evaluation: are VAMs fair, and are they more accurate that traditional evaluations? Citing researchers like Daniel Koretz, Jesse Rothstein, Dan Goldhaber, Michael Hansen, and researchers at the RAND corporation, she concluded that multiple threats to the "trustworthiness" of value-added measures have been identified, calling their fairness into queston. As for the second question—do VAMs do a better job of predicting effectiveness than traditional teacher evaluations do—her review concluded that they do not. That would seem to make VAMs a solution in search of a problem.
- Goldhaber & Hansen, two researchers cited by David, published another study in 2010 that concluded with this: "We suspect the results presented here will tend to reinforce views on both sides of the policy divide over whether VAM estimates of teacher job performance ought to be used for high-stakes purposes like determining tenure." Let's call the findings of that study inconclusive, at least as far as the evidence is concerned. But there's an implicit point being made here unwittingly by the authors: it is that although supporters of the use of VAMs would have you believe that this is all just science—and if you don't like it then it must be because you don't respect the use of evidence—really, underneath it all, this is simply about what people value. Remember that.
- Another study, published by Koretz and three co-authors at RAND in 2003, concluded usefully that "the claims of developers of Value-Added methods notwithstanding, VAM methods as currently developed are of limited usefulness as a tool for any routine assessment purpose, but are well enough developed to be quite useful as research tools." That's another important point: applying VAMs for limited research purposes might be appropriate and even useful; using them to evaluate something as complex as teaching with significant stakes attached is not. Whatever the value of VAMs may be, they should not be used as an evaluation tool.
- The National Association of Secondary School Principals has helpfully reviewed several studies as well, including one conducted by Morgan Polikoff and Andy Porter that was just published in 2014 and came with its own press release. That press release includes this summary of their findings: "New research...finds weak to nonexistent relationships between state-administered value-added model (VAM) measures of teacher performance and the content or quality of teachers' instruction. Based on their results, the authors question whether VAM data will be useful in evaluating teacher performance and shaping classroom instruction." Go ahead; read it for yourself. Or watch this video of Polikoff describing key findings. In it he says that "value-added scores don't seem to be reflecting the quality and content of the work that teachers are doing in the classroom." Trust him if you don't trust me.
Now, I cited a handful of studies that I found in five minutes on the internet; if I headed over to the library I could probably find 50 more. Does that mean these studies are conclusive? Of course not. But they all come from credible sources, and, if anything, they make it plainly clear that the question of whether VAMs can accurately and reliably help us identify effective teachers is very much an open one. Given that, I personally prefer to err on the side of caution. Teaching is one of those jobs that looks a whole lot easier if you're not the one doing it. For states to invest precious resources in an idea when multiple research studies have failed to confirm its viability as a policy solution makes little sense to me. And, yet, that is exactly what some states have done.
Which leads to Tennessee, and the "pilot study" I cited. To assert that the presence of a second teacher in a classroom can be isolated and accounted for is one thing; to assert that the second teacher's presence makes no real difference at all, as the findings of this pilot study apparently do, strains credulity. It makes you wonder if the model even works. This study looks, to me, like little more than an effort to put lipstick on a pig—a way to reassure nervous mentor teachers, administrators, and policymakers that the presence of inexperienced student teachers in a classroom won't muck up the VAM scores of the teachers who host them. Well that's good to know.
So where does that leave us? I made my argument against VAMs, but what if the proponents of value-added models are right? What if research someday vindicates them and proves that VAMs can be used to accurately determine the effectiveness of individual teachers? Well, here's the thing: even if they did do that, VAMs would still be bad policy. Why? Because the house of cards can only stand up if we use scores on standardized tests as a proxy for teacher effectiveness and student achievement. Of course teachers have an impact on student learning. The question is whether or not that impact can be isolated and measured for evaluative purposes, and the problem with VAMs is that they rely on student scores on standardized tests as the indicator of student achievement. Like I always tell my students, the problem with multiple choice questions is that you can always get the right answer without knowing anything at all. At the same time, the more sophisticated the tests get, the harder it's going to be to evaluate them in a standardized way, no matter how convoluted the model used to evaluate them may be.
That makes this line of thinking a genuine lose-lose proposition. If we simplify the tests, we may increase the viability of the one-size-fits-all evaluation mechanism—but we'll further narrow what gets taught and how it gets taught. On the other hand, if we make the tests more complex we'll make the models needed to generate value-added scores even more complicated too, and people have a right to know and understand exactly how they're being evaluated—especially if their jobs depend on it.
Maybe it's time to redirect. It's true that scores on a test like NAEP can predict future success in college—but, as Robert Putnam has just pointed out again, they predict it nowhere near as effectively as a student's family income does. Let's work on that data point. If statisticians really want to be helpful they'll start developing models to assess the value added to each child's education by the presence of a living wage for mom and dad, three square meals a day, a warm, safe, and comfortable place to live and sleep, and a school with the resources it needs to educate kids well. If economists really want to be helpful, they'll stop giving politicians excuses to look for "efficiencies" so they can cut spending even more and will instead start flooding them with data on the shameful inequality that makes real educational progress so hard to achieve. Let's focus on taking collective responsiblity for our kids, and stop asking teachers and schools to bear the entire load themselves.
Surely we can come up with a better plan than this. Surely we can come up with a teacher evaluation model that is more stable, less contentious, more respectful of teachers, and less likely to be misused by policymakers. I know educational researchers are up to the task. Are policymakers?