Sorry Mr. Press Secretary, Multiple Measures Are Not Fairy Dust

This week I engaged in another online debate with one of Arne Duncan's press secretaries, Justin Hamilton, who readers may recall asked me to "correct" my commentary a year ago after President Obama inadvertently criticized our over-reliance on standardized tests.

This time Mr. Hamilton took issue with a question I posed in advance of Duncan's latest Twitter Town Hall. I asked, "How can you say that we should not teach to test while NCLB waivers tie teacher & principal evaluations to test scores?"

To this, Hamilton (@EDpressSec) replied: "False. Waiver states using multiple measure not testing only."

Obviously, Mr. Hamilton is engaging in some sophistry here, because I never said that test scores were the ONLY measure being used. In any case, as a result of this pressure from the Department of Education, many states are incorporating the use of test scores into their evaluations, digested and delivered in the form of "Value Added Model" (VAM) ratings. This week New York became the latest state to jump on this bandwagon, requiring that 40% of an evaluation be derived from test scores. Furthermore, a release from the State Education Departments states: "Teachers rated ineffective on student performance based on objective assessments must be rated ineffective overall." Thus, as Diane Ravitch tweeted last night: "Teacher in NY agreement rated "ineffective" on 40% (test scores) will be rated ineffective, period. So 40%=100%."

Beyond this, what are the problems with using test scores or VAM ratings as one of a number of indicators of teacher performance?

Value Added Methods are not rendered reliable when they are combined with other measurement methods. We have solid evidence of the problems associated with VAM - which exist with the use of raw scores as well.

A study that looked at teacher ratings in five districts in Florida found that 27% of teachers who received "A" ratings one year got Ds or Fs the next year. 45% of them got C or lower. Furthermore 30% of the teachers who got an "F" in one year got an A or B the next year. 51% of them got a C or better.

Would you want to gamble 40% of your job rating on such a volatile indicator?

Second, VAM ratings are greatly affected by the students who are assigned to your class. Research has shown that English Language Learners and Special Ed students in particular tend to lower the ratings of teachers to whom they are assigned.

Take a look at this example from Houston:

This teacher began teaching in HISD in 2006. Until 2010-11, she was rated as "exceeded expectations" or "proficient" across every domain in terms of her supervisor evaluations. Like most teachers, she had positive (3 out of 6) and negative (3 out of 6) value-added scores across the years.

Can you look at her results and guess what happened in her third year?
 She was assigned to teach a large number of English Language Learners who were transitioned into her classroom, and as you can see, her value-added scores plummeted. This well-regarded teacher left the Houston school district, while those who have stayed are increasingly confused and demoralized by this system. For more on this, and the previous study referenced above, please see the report of a policy briefing sponsored by the American Educational Research Association and National Academy of Education, Getting Teacher Evaluation Right: A Brief for Policymakers, by Linda Darling-Hammond, Audrey Amrein-Beardsley, Edward Haertel, and Jesse Rothstein.

In the Oakland schools, where I worked for 24 years, many special ed students are mixed into the general population, and many that probably should be special ed are not even designated as such, because their parents do not wish them to be stigmatized. What is likely to be the unintended consequence here? Who will want to run the risk of teaching these students, when they could cost you your job or reputation?

Some suggested that other indicators that are among the "multiple measures" also yield good test scores. Thus teachers might attend to THOSE instead of teaching to the test. I am highly skeptical about this for several reasons. First of all, thanks to the NCLB waivers, administrators who will be doing the evaluating are under the same intense pressure to produce high test scores. Thus they are likely to be looking for secondary indicators that teachers are, in fact, preparing their students for tests. This idea was borne out by Melinda Gates' discussion of multiple measures last fall at Education Nation, when she made it clear that the way that Gates Foundation researchers checked the validity of student surveys was by choosing questions that correlated with higher test scores. (See a more thorough discussion of this circular definition of good teaching here.)

If we have decided that all that matters are student test scores, the fact that we have multiple ways of measuring teacher behaviors that result in good test scores only enhances the degree to which we have made test results central to our definition of good teaching. This focus has permeated the environment so thoroughly, we find it difficult to escape. The phrase "student outcomes," which sounds as if it might actually be a robust description of learning, is almost always boiled down to test scores of one sort or another.

Mr. Hamilton, and many lawmakers, seem to be under the impression that these various multiple measures are like fairy dust. When you sprinkle them on the test scores, they magically reverse all the bad effects we have witnessed from the past decade of NCLB-driven test score obsession.

Let's end by taking a look at the raw logic at work here. The argument seems to be that since we are only using test scores for PART of a teacher's evaluation, teachers will not feel much pressure to teach to the tests. If you think about it, this seems rather silly. If you have ever watched the Miss America competition, you know that there are a number of ways contestants earn points to advance. The criteria are actually as follows:
Lifestyle and Fitness in Swimsuit - 15%
Evening Wear - 20%
Talent - 35%
Private Interview - 25%
On-Stage Question - 5%

You can see from this list that the Miss America contest also uses "multiple measures."
Fully 65% of a contestant's score is derived from talent and interviews, and only 35% depends on appearance! Does this mean looks don't matter? Hardly! Just as a certain sort of physical beauty permeates this contest, test scores permeate the measures that are now being used for teacher evaluations.

The teachers of the United States have been entered in a very ugly sort of contest, where there will be few winners and many losers. The biggest losers will be our students, who will find that contrary to the bland reassurances of our highest officials, basing 40% of a teacher's evaluation on test scores will indeed promote teaching to the test. It will indeed make teachers reluctant to work with English language learners and Special Ed students. And it will drive good teachers out of the profession, exacerbating the already high turnover rate in the schools that are in the greatest need of stability.

Einstein once observed, "The significant problems we face can not be solved at the same level of thinking we were at when we created them." It is apparent that the same sort of mechanistic thinking that doomed our students to a decade of test-driven reform under NCLB is still at work here. And not even press secretary Justin Hamilton's fairy dusting of magical multiple measures can make this ugly reality disappear.

Update: Someone tweeted that "an article this critical of multiple measures should offer a solution." And Justin Hamilton responded to me "if you're against multiple measures, does that mean your for only using 1 measure? #NCLB tried that and failed." So to clarify, "multiple measures" is a mushy term that sounds inherently good, but is being used to mask the fact that the federal government is mandating the introduction of test scores into teacher evaluations. I am, of course, in favor of looking at multiple indicators as a part of a robust and holistic approach to teacher evaluation. Here is a post, with a link to a full report I helped write a couple of years ago: A Quality Teacher in Every Classroom. The key recommendations are:
1. Teacher evaluation should be based on professional standards.
2. Teacher evaluation should include performance assessments to guide a path of professional learning throughout a teacher's career.
3. The design of a new evaluation system should build on successful, innovative practices in current use.
4. Evaluations should consider teacher practice and performance, as well as an array of student outcomes for teams of teachers as well as individual teachers.
5. Evaluation should be frequent and conducted by expert evaluators, including teachers who have demonstrated expertise in working with their peers.
6. Evaluation leading to permanent status ("tenure") must be more intensive and must include more extensive evidence of quality teaching.
7. Evaluation should be accompanied by useful feedback, connected to professional development opportunities, and reviewed by evaluation teams or an oversight body to ensure fairness, consistency, and reliability.

And to be clear, I am against the use of unreliable and volatile VAM ratings, or the use of raw test score data, as these attach very high stakes to test scores, and inevitably will drive teachers to teach to the test, in spite of Secretary Duncan's rhetoric against this practice.

Update 2:
If you still think the use of test scores on evaluations is a good idea, take a look at principal Carol Burris' description of how teachers will find themselves evaluated in New York..

Update 3:
Since he accused me of a falsehood last week, I have been attempting to get Justin Hamilton to answer a simple question: "T or F: NCLB waivers require inclusion of test scores in teacher/principal evaluations." Thus far, no answer. Please tweet this to @EDpressSec if you think it deserves an answer.

What do you think? Are teachers likely to feel increased pressure to teach to the test when student scores are included as one of the means by which they are evaluated? Or will the use of multiple measures take care of the problem?

