Opinion
Teaching Profession Opinion

Responding to the Gates Foundation: How do we Consider Evidence of Learning in Teacher Evaluations?

By Anthony Cody — August 08, 2012 15 min read
  • Save to favorites
  • Print

Follow me on Twitter at @AnthonyCody

This post is the second round in a five-part exchange with the Gates Foundation. This is a response to yesterday’s post from Vicki Phillips, How do we Consider Evidence of Student Learning in Teacher Evaluations?

This post can also be viewed and commented on over at the Gates Foundation’s Impatient Optimists blog.

Vicki Phillips opens her post with a complaint:

Education debates are often characterized wrongly as two warring camps: blame teachers for everything that's not working in our schools or defend all teachers at all costs.

This handwringing is hard to take seriously, because, as I wrote about two years ago, there has indeed been a war on the teaching profession, and the Gates Foundation continues to arm one side very heavily.

The Gates Foundation continues to fund Teach For America, Stand For Children, The Media Bullpen, the National Council for Teacher Quality, Teach Plus, The New Teacher Project, and literally scores of other groups which carry on campaigns to undermine due process for teachers, and actively lobby for coercive legislation that forces public schools to use faulty test scores for the purposes of teacher evaluation, against the best judgment of administrators and academic experts.

The Gates Foundation gave $2 million to promote Waiting For Superman, a movie rife with falsehoods about public education, which greatly promoted the hostile climate in which we find ourselves.

Ms. Phillips’ post focuses almost exclusively on the work of the Measures of Effective Teaching Project, an initiative of the Gates Foundation. While the Gates Foundation has invested upwards of $300 million in this project, they have spent several billion over the past few years funding other groups who are active partisans in the war on the teaching profession. We have not yet seen enough of the systems under development by the MET project to really understand them, so I will focus my attention on the other fruits borne by Gates Foundation investments.

The first question that arises when discussing teacher effectiveness is how we measure student learning. While Ms. Phillips distances herself from the use of test scores, this has been central to the reforms advanced by the Gates Foundation thus far. It is possible that the MET project will chart new ground, but before it does so, it will need to reverse all the policies and laws mandating evaluation systems that rely on test scores that have been passed at the insistence of the Gates Foundation and programs it has funded.

Researcher Walter Stroup has given the testing paradigm a much-needed shaking up, in his report, on the way standardized tests have been constructed, as reported in the New York Times.

He focused on classes of students that had made significant strides in their understanding of math concepts. When he reviewed their standardized test scores, he discovered very little improvement, in spite of their learning gains. How could this be? He discovered that the test designer’s goal was not to create a test that was sensitive to learning, but rather was to rank students, to reproduce the spread of outcomes that we expect. These tests are “insensitive” to a great deal of learning, and of little use in evaluating the quality of instruction. Therefore, when the Gates Foundation (and its myriad sponsored projects) insist that test data be our guiding star, we are often misled.

This is no surprise to teachers. The Gates/Scholastic survey of teachers found that only 28% of teachers see standardized tests as an essential or important gauge of student assessment, and only 26% say they are accurate as a reflection of student knowledge. Another question reveals part of the reason this may be so - only 45% of teachers think their students take these tests seriously, or perform to the best of their ability.

Melinda Gates recently said on Nightline, “An effective teacher in front of a student, that student will make three times the gains in a school year that another student will make.” Math teacher Gary Rubinstein did some digging to figure out that the source of this statistic is a very weak twenty-year-old study by Eric Hanushek, an economist who has also “proven” that money does not matter in educational quality.

Bill Gates himself, on Oprah a few years ago also may have been channeling Dr. Hanushek when he claimed that if we got rid of all the bad teachers, “our schools would shoot from the bottom (of international rankings) to the top.”

The Gates Foundation has for years been paying for various studies and think tanks that have aggressively promoted the use of merit pay as a means of promoting teacher effectiveness. Even as Florida teachers describe the new evaluation system there as “artificial” and “frustrating,” Gates-funded outfits like the Southern Regional Education Board praise the state for their expanded data systems. The Gates-funded National Council on Teacher Quality is preparing a report that will grade schools of education based in part on their enthusiasm for test data, and the Gates-funded Data Quality Council exerts similar pressure on states to expand their investments in testing and data systems.

However, there seems to be a dawning awareness, among a number of Gates Foundation scholars, that some of the things the Gates Foundation says it wants ,such as a trusting collegial environment where collaboration and constructive feedback are the norms, are undermined by a competitive, fear-filled environment where nobody is sure if they will have a job next year.

Here is the problem. Before we can begin to build the kind of positive collaborative culture we need in our schools, we have got to unwind the damage the past decade has wrought on them.

We need to start with an understanding of what teachers can and cannot do. Because this idea that good teachers can eliminate the achievement gap, and bad teachers are to blame for much of it - it is just wrong. In the first place, most research, even that of our friend Dr. Hanushek, shows that the teacher is only responsible for about 20% of the variance in student performance. The lion’s share goes to characteristics the students bring with them - their family’s educational background, income level, neighborhood conditions, health care, and all sorts of issues that are closely related to poverty. This graph, drawn from a 2002 study, gives an idea:

Data for the above chart is from Teacher effects on student achievement, Rowan et. al., 2002.

So the starting point for our focus on “teacher effectiveness” is unwise, in a strategic sense. If the differences between teachers only account for less than a fifth of the differences we see in student outcomes, then putting all our attention there can’t be seen as a useful effort. On the other hand, if the key difference in student outcomes is due to factors related to poverty and racial and economic isolation, then perhaps we would get better results by focusing there. In fact, the greatest gains made in closing the gap student achievement between African American and white students took place between 1970 and 1980, when anti-poverty and desegregation programs began to take hold. The past two decades have seen those programs disappear, and our schools have become re-segregated. And the achievement gap has not budged, even as test-based accountability has made that its central goal.

But for the sake of this particular discussion, let’s agree that teachers are important and can make a difference in the lives of our students. We want to be the best teachers possible in front of our students, and therefore we should do all we can to make this possible. As Walter Stroup pointed out, teachers have experienced a decade of high-stakes testing that is impervious to actual transformative teaching.

The work that has been done to enhance teacher effectiveness to date has been crippled by the willingness to use standardized test data as an adequate proxy for learning. In fact, we all want much more from our schools than can be measured on a test. Richard Rothstein, in his book “Grading Education, Getting Accountability Right,” points out that we actually want the following things from our schools:


  • Basic academic skills and knowledge

  • Critical thinking and problem solving

  • Appreciation of the arts and literature

  • Preparation for skilled employment

  • Social skills and work ethic

  • Citizenship and community responsibility

  • Physical health

  • Emotional health

Of these, standardized tests only begin to measure the first. When high stakes are attached to these tests, all other goals become devalued. I saw this first-hand in the Oakland schools where the school board two years ago had to pass a directive telling elementary schools that they needed to teach science. Many of the low-income schools were spending two hours plus a day on reading, and another ninety minutes on math, leaving no time for science. The school board passed no such resolution about art or history, so these subjects continue to be left behind. This problem is intensified when we shift into a model where each and every teacher’s scores are used as a significant part of their evaluation.

More recently, Bill Gates has been voicing a bit more caution about the wave of reforms his words and money have spawned. In February of this year, he wrote an op-ed in the New York Times distancing himself from the publication of the value-added scores of individual teachers. He wrote:

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren't a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students' feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

In a speech given a few weeks ago, Mr. Gates publicly backed away from some other policies that his dollars continue to advance. On paying teachers for better test scores, all of a sudden he has no opinion.

Now, let me just say that at this time, we don't have a point of view on the right approach to teacher compensation. We're leaving that for later. In my view, if you pay more for better performance before you have a proven system to measure and improve performance, that pay system won't be fair - and it will trigger a lot of mistrust. So before we get into that, we want to make sure teachers get the feedback they need to keep getting better.

This is progress, I guess. Though I am not quite sure why, if he is aware of the potential harm many of the merit pay schemes are now causing, he does not oppose them, especially since many of them have Gates Foundation fingerprints on them. Furthermore, while he professes no opinion about compensation systems, there is no apparent lack of confidence in their models for evaluation systems, which are even more dependent on an atmosphere of trust. And all these systems are continuing to go forward in legislatures across the country as a direct result of advocacy by various groups working under the Gates Foundation’s sponsorship. All these systems for teacher accountability - is there any accountability for the organizations responsible for these crazy evaluation systems?

There is a whole lot of research piling up that shows how ineffective paying AND evaluating teachers based on test scores has been. There is no logical reason that the pressure we apply in compensation is any different from the pressure of a high stakes evaluation system. If we apply our understanding of pay to the issue of evaluations, perhaps we might get some insight. Daniel Pink says the best way to motivate people with money is to pay them enough to take it off their minds. What if we looked at evaluations in a similar way? How do we make professional growth central? We need to remove the distraction created when high stakes decisions are triggered by unreliable VAM scores. And we need to make sure there are strong due process systems in place so teachers are not in fear that their weaknesses may be used to justify unfair dismissal.

With our students, we know from the research of Paul Black and Dylan Wiliam (Inside the Black Box) that students grow the most when they are focused not on the grades they might be earning, but rather on the quality of their work. How can teachers focus on the quality of their work if they are in a state of constant fear over losing their jobs if their test scores do not rise fast enough? This intense pressure, at the level of school and teacher, is very counterproductive when we want to promote the conditions for growth.

That does not mean nobody should ever be fired. But as I argued here, our schools, especially in high poverty areas, would be far better served by an emphasis on supporting and retaining teachers than it is being served by a campaign to weed out the poor performers. And if you would like to see a teacher evaluation system truly aligned with reflection and growth, take a look at the report written by Accomplished California Teachers, A Quality Teacher in Every Classroom.

In the framework Bill Gates offered a few weeks ago, he once again discussed the need for “multiple measures.” Ms. Phillips echoes this in her post. Unfortunately, in a high stakes environment such as we now have, adding additional elements to an evaluation system does not make the pressure to focus on test scores disappear. And I do not share Ms. Phillips’ optimism that the new Common Core tests will somehow evolve beyond the limitations evident in every previous generation of standardized test.

The Department of Education followed the basic model recommended by Gates when designing the NCLB waiver guidelines, and has required a “significant” portion of teacher AND principal evaluations to rest on student outcomes - which means test scores. This means in the states that have received those waivers, 40% to 50% of these evaluations rest on test scores. As Bill Gates said: “Test scores have to be part of the evaluation. If you don’t ground evaluations in student achievement, evaluations will conclude that “everyone is excellent,” and that holds teachers back.”

I will let the Gates Foundation folks explain the way they envision multiple measures working. They must understand, however, that their track record thus far does not inspire confidence.

Reformers have tied huge stakes to these evaluations. At the urging of those who have declared getting rid of “bad teachers” job one in our schools, many states have significantly weakened due process for teachers. As a result, faulty evaluations are resulting in career-ending decisions. The test score component is almost always going to be made up of Value Added Model scores, which have been repeatedly demonstrated to be hugely unreliable. The impact of these scores cannot be washed away by including student surveys and classroom observations. And remember that often these observations will be done (or supervised) by principals, whose evaluations likewise now rest in large part on test score gains.

Let’s take a close look at why Value Added Models are now being taken apart by education researchers and mathematicians.

As Linda Darling-Hammond’s research has highlighted, students from low income, special education and English learners are consistently harder to raise on the VAM scales. This creates serious disincentives we are already seeing take effect in places that are using VAM - even though it is only one of multiple measures. Teachers will seek to avoid these students. This is reality - not theory.

This graph illustrates the student characteristics of one teacher whose VAM scores showed a significant improvement from one year to the next. This trend is repeated. Given this, who would choose to teach a class such as this teacher had in year one? And how long would they last in a system where a VAM score was 40% or 50% of one’s evaluation?

If you are not convinced of the problems with Value Added Models, please review this essay by mathematician John Ewing, who summarizes the many flaws once again in detail.

These systems are causing good teachers to be misidentified and hounded out of the profession. And worst of all, these systems penalize teachers working with the most vulnerable populations of students, reinforcing their stigmatization, and making it even harder for them to get the teachers they need. This is not only unfair, it is terrible for students.

I want to close this essay with some questions for representatives from the Gates Foundation.

How can teacher effectiveness accomplish the Herculean task society has set for it, of eliminating the achievement gap, when the differences between teachers only account for, at most, about 20% of the variance in student outcomes?

If you have seen the damage done by introducing competitive merit pay schemes, as suggested by Bill Gates comments, why have “no opinion” about this? Don’t you have a responsibility to actively reverse some of the damage your advocacy has done?

Is it not reasonable to assume that the damage done by competitive merit pay schemes will also be done by attaching high stakes to evaluations, especially using test-driven VAM systems that have been demonstrated to be highly volatile, based on student ‘data’ which is itself fraught with error, and biased against those teaching the most vulnerable students?

How can we build the foundation of trust we need for more effective evaluations and feedback for teachers when unreliable Value Added systems have already been imposed as a significant part of the process?

If the Gates Foundation honestly believes that the heart of teacher evaluation ought to focus on improvement, not ridding schools of the bottom five percent, then shouldn’t they actively support, rather than undermine, due process protections that allow teachers the security to engage in meaningful reflection, without fear of being fired for their shortcomings?

If the Gates Foundation wishes to reverse the effects of the war that has been so devastatingly waged against the teaching profession, it must first come to terms with the role it has played. Any attempt to dance around the very real damage that has been done invites dismissal by honest teachers. Evaluations that rely in any way on VAM scores are causing great harm to teachers and their students. If the Gates Foundation is unaware of this, after having spent $300 million studying how we can best measure effective teaching, I question its capacity to learn. If the Gates Foundation IS aware of this, given its role in advancing these methods, it is not enough to simply come out with another, more nuanced model - while the old model continues to wreak havoc in our schools. If the Gates Foundation is accountable for its work, it must undo the harm to which it has contributed.

The real “serious work” being done in education is not taking place in think tanks and research facilities. It’s being done in classrooms in communities that are experiencing real and profound trauma. Yes, teacher evaluation ought to be all about reflection and growth. A great start would be to create the conditions that will make that growth possible, and stop obsessing over test scores and measurement systems.

What do you think? How should we look at evidence of learning in teacher evaluation and the work of the Gates Foundation in this arena?

[Editors’ Note: The Bill & Melinda Gates foundation helps support coverage of business and innovation in Education Week.]

Graphs used with permission.

The opinions expressed in Living in Dialogue are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.