Evaluation: Just Like a Blender?
Note: Andrew McEachin, an assistant professor of educational policy analysis and program evaluation at NC State University, is guest posting this week.
I promised that I would spend today talking about promising teacher evaluation models, but I got sidetracked by a recent New York Times article on the Obama administration's desire to create a college rating system. In the article, Jamienne Studley, a deputy secretary at the U.S. Department of Education, is quoted as saying rating the U.S. college system is like "rating a blender ... This is not so hard to get your mind around." The comparison of rating a dynamic social system to a kitchen appliance is a gross oversimplification of how difficult it is to build a valid, reliable, and trustworthy public accountability model. It is also a stark reminder of the disconnect between what researchers know about the design and impact of accountability systems, and what is implemented by federal, state, and local policymakers. If creating sound policy was as easy as Studley claims, why do we consistently see policies adopted that fly in the face of decades of social science research? I will instead spend today discussing a few of the largest gaps between the research and policy community with respect to teacher and school accountability policies.
Don't get me wrong, part of the problem is that the research community does not do a good enough job disseminating our work to the greater education stakeholder community (I applaud organizations like Scholars Strategy Network and Policy Analysis for California Education in their efforts to close this gap). With that said, Studley's comment, and many of the accountability policies implemented to date, reminds me of paper by Professor Steven Kerr titled On the Folly of Rewarding A, While Hoping for B. The issues below are prime examples of hoping for one outcome, but due to design and implementation issues, measuring another.
The education production process is multi-dimensional
To be an effective educator, regardless of how "effective" is defined, teachers have to take on a number of different roles, each with its own requisite skills, knowledge, and duties. However, when it comes time to hold teachers accountable for their students' outcomes, we typically only measure a few of the dimensions of teacher quality. As Kerr noted, it is silly to think that holding teachers accountable for a few objective measures of student performance will get them to improve in all areas of teacher quality. A recent paper by Kirabo Jackson showed that teachers have a profound effect on students' non-cognitive outcomes, and these non-cognitive outcomes are important predictors of success in postsecondary education and the labor market. If we continue to narrowly define teacher quality, there is little reason to believe that teachers will improve along a number of important, but unmeasured, dimensions.
Parents and other stake-holders have different values and expectations for schools and student outcomes
Just as the education production process is multi-dimensional, so too is the list of desired knowledge, skills, and traits we expect students to acquire in our schools. In a seminal article Professor David Labaree outlines the competing, although not mutually exclusive, philosophies/goals that guide what we expect out of our schools. More recently, in a string of studies, colleagues from Michigan State have evaluated how parent and other stakeholder satisfaction is influenced by the design of an accountability model, and how the backgrounds of the stakeholders influence their schema of an ideal school (here and here). This is all to say that once you factor in the multi-dimensionality from the production side (teachers, principals, etc), and the consumer side (parents, the general public, etc), the process becomes much more difficult than whether my blender makes a delicious smoothie.
Proficiency rates are noisy and tightly linked to student demographics...yet we still use them
One of the main complaints with the NCLB-style accountability model is its reliance on holding schools accountable for students' proficiency rates on math and reading assessments. These measures are strongly correlated with the types of students schools serve, predictably over-identifying schools with larger shares of minority or low-income students as failing. A second problem is the use of year-to-year changes in proficiency rates as a measure of school improvement. These year-to-year changes are rife with statistical noise, akin to flipping a coin to deciding a school's fate. And yet, the ESEA waivers still require states to use proficiency rates as a measure of school success (here). In most states, even those that use growth measures, proficiency rates make up a preponderance of a school's accountability grade, index, or other summative rating. The use of proficiency as rates the main criterion by which we define school success will continue to penalize the schools that need the most support.
Context is important, but we don't want to do anything about it
It is a well-known fact that roughly three-quarters of a student's math or reading test score is driven by factors that occur outside of a school. Therefore, if you want to get at the true impact of a teacher or school on a student's outcomes, you need to account for these non-school related factors. The NCLB-style proficiency rate system did not account for these factors. The ESEA waiver program allows states to implement models based on changes, or growth, in student achievement in an effort to hone in on the true impact of teachers and schools on students. However, the type of growth models allowed by the feds does not account for a majority of the contextual differences among schools. In fact, the growth models (e.g., Student Growth Percentiles) are only marginally less correlated with student demographics than proficiency rates (see here and here). We therefore shouldn't expect to see any major differences between the NCLB era and the ESEA waiver era in terms of which schools are labeled as failing (more on this on Friday).
More years of data are better
While you might be thinking that this is obvious, it doesn't appear to be to policymakers. We have known for a long time that using multiple years of data to generate teacher and school performance measures is more reliable than using data from a single year (here and here). However, even with the opportunity to construct a new accountability system under the ESEA waiver process, most states are still relying on just one year of data to place schools in performance categories. Given that most states have collected at least school-level performance data for the better part of the past 10 years, it is reasonable to assume that states could readily implement longer periods of time in their accountability systems.
I think Studley's analogy is more apropos of the current policy landscape: Let's throw some ideas into a blender, mix some stuff around, and see what sticks. Instead of implementing more of the same, I am, perhaps naively, optimistic that there are opportunities for researchers, policymakers, practitioners, parents, and other stakeholders to come together to design an accountability model that is valid, fair, reliable, and trustworthy. The research community is just starting to understand how school-level accountability policies impact students, teachers, and so on, and now we're talking about holding universities and teacher programs accountable. For all the issues listed above, there is a positive role that accountability policies can play in the future of U.S. public education. We just need to slow down, work together, and stop pretending this stuff is easy. Tomorrow I will discuss a few ideas for the design of teacher and school accountability that I find most compelling.