NOLA Argument #10: Does Playing With the Numbers Change the Outcome?
One of the most common critiques of the New Orleans success story is that the state government, which is largely in charge of the reformed school system, is manipulating data--especially school letter grades--to make the reforms look better than they are.
The root of the problem is that the system's advocates often try to convey the success of the system by referring to the percentage of students in failing schools. The Louisiana Department of Education reports that this percentage has dropped from 62 to 6 percent in New Orleans between 2005 and 2014. This statistic has drawn fire from critics because, it is argued, the state has been changing the scale of the School Performance Score (SPS), altering the scale that translates the SPS into letter grades and, ultimately, playing with the definition of failing schools.
As with most criticisms, this one is half right. On the one hand, the percentage of students in failing schools really is a poor measure of success. On the other hand, the fact that it is a poor measure turns out to be mostly irrelevant when evaluating the reforms.
To get a better handle on the issue, let's take a step back: First, what is a failing school? Currently, it's one that receives an F grade from the state. Where does the F grade come from? From the state's School Performance Score (SPS). Where does the SPS come from? From the percentage of students who reach certain test performance standards like "proficiency" and decisions by policymakers about how important it is for students to meet these various standards. Where do the student performance standards come from? From another set of well-meaning educators, test designers, and policymakers sitting in a room doing their best to define proficiency from the scale scores provided by the test designers. (An example is the ACT which falls on a scale of 1-36.) The problem is that the decisions that have to be made in translating student scores into school labels are ultimately somewhat arbitrary and often change over time.
So, why not just bypass all these decisions and go straight to the source--why not just judge school success based on student scale scores? That's exactly what we did when we estimated the effect of the reforms on student outcomes. Looking at the data this way, we see a large improvement in student achievement, in the range of 0.2-0.4 standard deviations, or 8-15 percentile points; that is, if all students had been at the 50th percentile, then this would increase the average student to the 58th-65th percentile. This shows that there is no reason to rely on the percentage of students in failing schools when trying to understand the effects of the reforms.
But this wording also shows why it's so tempting to report results based on the number of students in failing schools. Our approach to the analysis, using scale scores, forces us to talk about the reform effects in unfamiliar terminology. "Standard deviations" are used by academic researchers like me, but few others. Percentile rankings are better, and I think probably the best compromise of simplicity and validity. (Some have suggested that we translate these into "days of learning" as some researchers do, but that metric is also misleading--it implies that the achievement is driven by the number of days students are in school, which is far from the case.)
This is partly why the conversation often turns back to the percentage of students in failing schools--this way of talking about the effects is simple and intuitive. Or, more precisely, it masks the actual complexity by relying on an ambiguous and evolving definition of "failing." We simplify things like this all the time in everyday life, often for good reason. When I drop my daughter off at school, I say, "Have a great day," not "Have a day that's two standard deviations above the average."
When is it OK to simplify this way? That is, when should we use letter grades and failing grades and when should we use the more technically correct approach? Certainly there is a time and place for each. Goal-setting is one. Take New Schools for New Orleans (NSNO), which has incubated charter schools in the city and provides various forms of school support. They often talk about the goal of creating "50,000 High-Quality Seats," meaning that every student should have a place in a good school. This is a laudable and ambitious goal. I think in that situation, simplification makes sense.
We also need to define failure (and success) for the purpose of determining whether and how a state or school district should make changes to ensure that all children are learning. This makes much more sense when the failing label comes from a good performance measure, which few states actually have. Louisiana is not much better, as I wrote in a report to the Louisiana Board of Elementary and Secondary Education. NSNO's definition of a "high-quality seat" also addresses the same problem.
But good performance measure or not, there will be times when those labels are less useful. In this case, if we are trying to evaluate the success of the school system, then the use of percent in failing schools exaggerates the reform effect: the 62-to-6 percent drop in the percentage of students in failing schools sounds like the system is 10 times better than before and I don't think anyone really believes that.
On the other hand, the effects do seem very large. As we reported in our article in the academic journal, Education Next, the New Orleans school reforms appear to have been more successful in raising test scores than almost any common alternative program or policy that has good evidence.
If that isn't intuitive enough, then simply reporting the changes in the percentage of students at various performance standards would be an improvement. Almost anything is better than the percentage of students in failing schools.
In this case, relying on the failing label seems unnecessary and opens up advocates to unnecessary criticism. They pay a price for that. In fact, we all pay a price because it sidetracks the debate from what we should really be talking about--about the actual strengths and weaknesses of the school system and how to solve them.
Does the reference to percent of students in failing schools exaggerate the effects of the reforms? In this case, yes. Should advocates stop measuring the success of the system this way? Also, yes. But does this change the general conclusion that the reforms significantly increased student achievement? No. When we bypass the letter grades, we still see large positive effects.