Duncan's Hammer: Test Scores
It has been observed that when the only tool you have is a hammer, every problem looks like a nail...
And that the hammer shapes the hand...
"If I had a hammer, I'd hammer in the morning, I'd hammer in the evening, all over this land."
Of course, that last example, in the lyrics by Pete Seeger and Lee Hays, turns out to be a hammer of justice. Arne Duncan has a hammer, and it's a hammer of testing. The basic premise of so many policies from the Education Department is that test scores are important and threats are an appropriate motivation to raise test scores. If you (teacher, school, disrict) don't get those test scores up, the hammer comes down.
I'm simplifying the Obama-Duncan policy agenda a bit, but in almost everything they've proposed we find the use test scores as a measure of success, and the basis upon which to employ threats as the lever of change. Of course, they don't say "test scores" every time they refer to test scores. The preferable terms are learning, growth, outcomes, and achievement. Because who could be against that?
The latest problem that the Education Department has decided to tackle in part through testing data is the uneven quality of teacher preparation. I don't mind at all that the Department is suggesting the need to improve in this area. I'm entirely in favor of examining and improving the entire spectrum of teacher professional development, from training and induction, to evaluation, ongoing professional learning, and differentiated career pathways. Those are topics I've studied along with fellow teachers from around California, and we published our recommendations in a pair of reports in 2010 and 2012.
The Education Department proposal (press release) does suggest the collection of some information that might be useful, such as data about teacher retention, and the satisfaction of both teachers and the employing schools or districts. The press release also cites a number of innovative and promising approaches already in use for improving teacher preparation around the country. But then we see the typical Duncan flaw in the policy; testing data included where they don't belong, and the threat of reduced funding as a potential consequence for low scores.
Now, to be clear, the proposed rules from the Education Department do not technically require comparisons of teacher preparation programs through the use of test scores, but rather, calls for examination of "student learning outcomes" - and please add a comment below if you can envision how schools, districts, or states will end up using anything other than test scores to comply with that policy.
My critique is not at all a defense of the status quo; if anything, proposing the use of test scores for myriad purposes beyond the intent of their design is becoming the status quo. I've written frequently about the problems of using test scores for teacher evaluation. There's simply too much going on in a classroom, in a school, and in a student's mind, to make valid inferences about teaching based on highly limited standardized assessments, which weren't designed as measures of teaching in the first place. (See, for example, Haertel; Popham; EPI).
Now, in government and think-tank offices, or in the halls of universities, these uses of testing data might sound reasonable, and experts on both sides can debate their supposed ability to control for this factor and that factor, run the data through this model or that model, find standard deviations, express the effects in decimals or as "extra months of learning." But on the ground, their modeling looks more like muddling, and in fact, the evolution of teaching practices is working squarely against their efforts to isolate teacher effects.
For starters, consider the contrasting situations for elementary school teachers I've visited in the past few months. One teacher works at a school where every class has 20 or fewer students, pull-out reading and math support is available to the neediest children mulitple days per week, and grade level collaboration happens regularly among teachers who all have years of experience working together. However, this school has not invested in much digital technology, and students spend little time using computers or keyboards. At another school, the teacher has a larger class and much less support. However, since it's a magnet school, the class has consistent time spent in science labs, meaning that they have different types of opportunities to work on literacy and math skills - with a different teacher. And when they rotate through the computer science lab, they have more practice with keyboarding as well, which comes in handy on new Common Core assessments.
And one of these teachers works in Los Angeles Unified School District, meaning that she has lost hours of work time this year mitigating the problems of a failed district-wide student record-keeping system. That problem didn't exist last year, and hopefully won't exist next year. If only there were a good study out there somewhere to provide the appropriate mathematical model to control for single-year MiSiS-meltdown variables in value-added measures.
Now tell me again how test scores from these students will reflect the quality of their teachers' training and preparation.
Supporters of Duncan's proposal will point out that I'm using individual examples, which is not the intent of the policy, and that once there are large enough sample sizes, the effects of these particular variables will even out. If there were random distribution of a program's graduating teachers among districts and schools, random distribution of teachers from different programs within schools and districts, and if supposedly "comparable" classrooms, schools, and districts were really comparable in practice, and if practices, policies, and assessments were consistent from year to year across settings - then I might be more open to the potential of such comparisons. But that's not the case. And that's only elementary school.
At the secondary school level, we have to write off a few subject areas immediately because their students won't produce any test scores that can reasonably be linked to the curriculum. Even if we focused on the teachers in math and English language arts, we'd still face the significant differences in district and school induction and support, differences in collaboration, differences in curriculum. And secondary schools are evolving in a variety of ways which will be good for students but bad for linking tests to teachers. Some schools teach a year's worth of content in a semester by doubling instructional time per week and cutting in half the number of classes a student studies per semester. (If your math class is first semester and your Common Core test comes second semester, you have the advantage of having finished the class but disadvantage of not studying math recently. If your math class is second semester, math practice is fresh in your mind but you haven't finished the course yet). Some schools have gone so interdisciplinary that students don't have an English class with an English teacher, but rather, a Humanities class with two teachers. School-level specializations such as arts and STEM, school-level decisions about project based-learning or traditional instruction, the size of the school, style of collaboration, and stability of leadership all contribute to a teacher's effectiveness, and no one knows at what point the effects pre-teaching experience begin to fade relative to the effects of the school setting.
Once again, we see Duncan and the Education Department identifying an actual problem and coming up with the wrong approach to solving it. I'm hardly surprised to be disappointed once again; they're ready to impose high stakes consequences before seeing the potential analyses of non-existent data gathered from new and unproven assessments designed for purposes unrelated to the policy matter at hand.
If you have any thoughts you'd like to share with the Education Department about this proposed policy, they're accepting public comments until February 2, 2015.