Interpreting the Preliminary MET Findings
A prominent scholar has criticized the preliminary report the Bill & Melinda Gates Foundation recently put out on its Measures of Effective Teaching study, arguing that the conclusions the Gates Foundation draws from its data aren't supported by its research.
You'll remember that the earlier paper found that value-added estimates of teachers, particularly in math, did appear to predict performance in a teacher's other classes and on other tests. Student surveys also seemed to offer up some clues on teacher effectiveness.
But Jesse Rothstein, a University of California, Berkeley researcher who's written a number of papers on value-added measures argues that the Gates report actually appears to undermine the potential usefulness of of value-added as part of teacher evaluations, the opposite of the conclusions drawn in the preliminary MET report. At the very least, the report's enthusiasm for value-added is "premature," he said in an interview.
For example, the Gates project found a .54 or "moderate-to-large" correlation between teachers' value-added performance on the state math test and on a supplemental, more cognitively demanding exam. By Rothstein's reading of the data, a .54 correlation means that a teacher whose value-added put her at the 80th percentile would still have a 30 percent chance of scoring at the lower end of the scale on the other test.
The two parties aren't debating the actual data in question. Instead, what this amounts to is a problem of interpretation: Where the Gates folks find promise in the notion that there's some predictive power to these value-added measurements, that power, in Rothstein's view, is "shockingly weak." It hints that teachers' estimates on value-added measures depends on things like how closely they follow the curriculum on the test, he said.
When I asked Rothstein to describe what a less-worrisome, strong correlation might look like, he suggested an 0.7 or 0.8 correlation.
In the paper, Rothstein also argues that the Gates experts didn't account for the "fade out" of value-added estimates over time, as he's documented elsewhere, and that other problems, such as the nonrandom assignment of students to classes, mean that it's hard to know how these calculations would play out in the real world.
He contends that, by stating that value-added seems to be a relatively strong predictor of performance—before study on all the other measures is complete—the foundation's interpretation is skewed toward support of value-added.
I checked in with the Gates Foundation after reviewing the critique. Here's what spokesman Chris Williams had to say:
"I think we would say that Rothstein is right that we need more randomized evidence of the causal effects of the estimated teacher impacts," he said. In fact, that's actually a part of the project set to begin this year.
But, he added, the foundation doesn't think that value-added should be the sole measure of teacher effectiveness.
"The paper highlights the mistakes if we were to rely solely on test-score measures, but we're not advocating that approach, and I think we've been pretty clear about the purposes of the project," Williams said. "We don't think value-added [alone] should be teacher evaluation."
Whether you view these results as a boon or a blow to value-added, the bottom line for the moment is that researchers and practitioners really don't (yet) know the optimal mix of teacher evaluation measures or how to weight them.
Other questions to consider: What happens if some of the other measures being studied, like the much-awaited teacher observations, don't turn out to be all that highly correlated with student achievement? It's a distinct possibility. After all, as it turns out, the .54 correlation between tests was one of the stronger correlations in this preliminary report. (Student ratings, though consistent across a teacher's classrooms, were generally less predictive of his or her value-added performance.)
When all of the data is out from MET, will there be enough evidence from all these measures combined to make a decent evaluation tool? Could the problems with any one of these measures be balanced out by the strength of the other measures (i.e., if a teacher gets dinged on value-added but gets a strong observation score)? It's hard to say at this point, but the preliminary MET did find that the correlations got a little stronger when both value-added and the student perception information were combined.
In any case, whatever system is used to judge teachers will have some degree of error. Policymakers, teachers, unions, and school boards will need to come to an accord about how much and what type is acceptable for which purposes.
To the issues Rothstein raises in this paper, if the parties agree to pick value-added as one measure—and that's a big "if"—then they'll also have to make a decision about which value-added calculation to use, which test is most appropriate for that purpose, and which curriculum will underpin that test.
Not an easy thing to do.