« Vanity Fair | Main | Cool People You Should Know: Jonathan Zimmerman »

# Why skoolboy Is Uncertain about the NYC School Progress Reports

It’s election season, which means that we’re being inundated with polls. The reporting of poll results drives statisticians nuts, because the press often reports the percentage of those surveyed who favor one candidate or another, without taking into account the poll’s margin of error. The margin of error is a way of quantifying the uncertainty in the poll numbers, because even a well-designed poll that surveys a random and representative sample of the population is going to generate an *estimate* of the true proportion of those in the population who favor a particular candidate. The general rule of thumb is, the more information available in a sample, the less uncertainty in the estimate. A smaller batch of information will yield a more uncertain, or imprecise, estimate than a larger batch of information. This is as true for estimates of the relative performance of schools and teachers—whether in the form of a complex value-added assessment model or a simple percentage—as it is for political polls.

With apologies to anyone who’s had an introductory statistics course, suppose that we were trying to estimate the average age of the teachers in a very small school—one with only four teachers—but we can only draw a sample of three of the teachers to estimate that average. The four teachers are 25, 30, 30, and 55 years old, and the true average age is (25+30+30+55)/4=35. If our sample was the teachers who are 25, 30 and 30, our estimate of the average age of teachers in the school would be (25+30+30)/3=28.25. If our sample was the teachers who are 30, 30 and 50, our estimate of the average would be (30+30+55)/3=38.33. It’s a simple example, but it shows that different samples drawn from a given population can produce quite different estimates, that can be some distance away from the true population value. You wouldn’t want to place too much confidence in a particular estimate if you knew that another, equally valid sample of the same size could generate an estimate that was quite different.

That same logic applies to estimates of school and teacher performance, such as the New York City School Progress Reports. Most of the elements of the Progress Reports are estimates (for an explanation why, see here), but the calculation of the overall letter grades which receive so much attention do not take the uncertainty in these estimates into account. Today, I’ll show that using the 2008 School Progress Reports.

One of the indicators of student progress on the School Progress Reports is the percentage of students who made a year’s worth of progress in English (ELA) and in math from 2007 to 2008. In a given school, each child who was tested in both years can be classified as having made a year’s worth of progress or not, and by totaling up those students who made a year’s worth of progress and dividing by the number of students who were tested in both years, a percentage can be calculated. (There’s an additional wrinkle for students who transferred from one school to another, but it doesn’t affect the logic I’m writing about.)

Each school is compared to a group of 40 peer schools that are judged to be similar based on their demographic and other characteristics. A school’s percentage of children making a year’s progress in ELA is compared to the highest and lowest values in its peer group, and the school gets a peer horizon score that represents its location between the high and low peer group values. For example, if a school had 55% of its students make a year’s progress in ELA, and the percentage for the lowest school in its peer group was 47%, and the percentage for the highest school in its peer group was 71%, the school was located one-third of the way between the lowest and highest schools (8 percentage points above the minimum, out of a possible 24 percentage points above the minimum in the peer group.) That peer horizon score of .33 would be multiplied by the 5.625 points that this component is counted in the calculation of the overall letter grade of the school, yielding a net contribution of 1.875 to the school’s overall score.

The problem is that this calculation doesn’t take into account the fact that all of these percentages are estimates. The chart below looks at one elementary school in particular—Senator John Calandra School (08X014)—and compares it to its peer group of 40 schools. At Calandra, 58.3% of the students made a year’s worth of progress in English in 2008. But the standard error of that percentage is 3.5%, which means that it’s possible that Calandra's true percentage could be anywhere from 51.3% to 65.3%, a wide range. (This range is shown in the “error bars” above and below the estimated percentage for each school.) The same is true for most of the other schools in the peer group. In fact, only two of the 40 schools in the peer group (the ones with the blue markers in the chart) have a percentage that we are confident is higher than Calandra’s percentage. For the other 38 schools in the peer group, we can’t rule out the possibility that Calandra’s percentage is equal to the estimated percentage in those schools. There’s a tremendous amount of overlap among these schools.

And yet Calandra received a peer horizon score of .463, and other schools in the peer group whose percentages of students making a year’s worth of progress in English did not differ statistically from Calandra received peer horizon scores ranging from .169 to .903. Calandra’s peer horizon score of .463 counted for 2.6 out of a possible 5.625 points toward the overall score on the School Progress Report. Other peer schools *whose percentages did not differ significantly from Calandra’s* received from 1.0 to 5.1 points out of a possible 5.625 points on this component of the overall score. Differences of this magnitude could easily make the difference between an overall grade of A and of B, or of B and of C—*just due to chance*. An accountability system such as the New York City School Progress Reports that doesn’t acknowledge the importance of chance and uncertainty is fundamentally misleading the public about its ability to distinguish the relative performance of schools. Some schools are likely doing significantly better than other schools; the problem is that the School Progress Reports don't provide enough information to judge which ones.

Thank you, skoolboy, for explaining so clearly the imprecision of the progress reports!

Just this morning I read a relevant passage in Tocqueville (

Democracy in America, vol. 1, chapter XIII):"... for when statistics are not based upon computations that are strictly accurate, they mislead instead of guiding aright. The mind is easily imposed upon by the affectation of exactitude which marks even the misstatements of statistics; and it adopts with confidence the errors which are appareled in the forms of mathematical truth."

Bravo, SB. I wonder to what extent principals, teachers, and parents at NYC schools understand the huge random component built into these report cards. The system is so complicated that even I can barely stand to read through all of the documentation, and I doubt that anyone else really gets it either.

If we're going to have report cards (and I'm not sure what the value-added is, honestly), I would take NY State's cardiac surgery report cards as a model, which take into account the uncertainty around the estimates of hospital performance - i.e. a hospital is either labeled a negative or positive outlier, or no different than expected (most hospitals are in that category, and I bet most schools would be, too).

thanks for the post skoolboy. i have a methodological question, probably a simple one.

given the standard error (3.5%) of Calandra's percent making ela progress (58.3%) how did you calculate Calandra's true percentage range (from 51.3% to 65.3%)?

Anon, I've got to be careful how I say this, or the

realstatisticians will be up in arms. As I look back at what I wrote, I fell into the trap I was trying to avoid. It's not really the case that the "true" percentage is between 51.3% and 65.3%. That range is a 95% confidence interval, calculated by taking the sample value of 58.3% plus or minus two times the standard error. There are lots of ways of talking about confidence intervals; what I'll say here is that if the true value of the percentage in the population lies outside of this interval, it is extremely unlikely (less than 1 chance in 20) that we would observe the percentage we see in our sample or a percentage further away from the population value. The values in the confidence interval represent population values for which we might reasonably expect to see our observed value at least 5% of the time if repeated random samples were drawn from the population.I hope my explanation hasn't made things worse!

Another methodological question. How do you determine that the two schools have a higher percentage of students making progress than Calandra when the uncertainty margins overlap with Calandra's.

Trevor: The statistical test here is a two-sample test for the equality of two percentages or proportions. It's the difference between the two percentages divided by the standard error of that difference (which is the square root of the sum of the first percentage's standard error squared and the second percentage's standard error squared). When the absolute value of the difference in percentages divided by its standard error is greater than or equal to 2, and the sample size for the two percentages is reasonably large, one would typically reject the hypothesis that the two percentages are equal in favor of an alternative hypothesis that they are different.

In this case, even though the confidence intervals overlap a little bit, the standard error for the 10 percentage point difference (68%-58%) between these two schools and Calandra is about 4.2%, so the ratio of 2.3 exceeds the threshold for concluding that they are significantly different.

Lord help me if I've made a mistake in this--I'll never hear the end of it.

Thanks skoolboy. But, am i to conclude from this that mapping the uncertainty intervals across different schools does not really tell us much about whether there is any statistical difference between schools' performance? You seem to be saying that we need to do this more sophisticated calculation in order to determine whether a school's performance is statistically distinguishable from others. Is this a fair interpretation? If so, why not report these estimates directly rather than map uncertainty intervals.

Two observations:

One:

MIT professor Walter Lewin begins his introductory physics lecture on measurement by noting:

Any measurement that you make without any knowledge of the uncertainty is meaningless.

I will repeat this.

I want you to hear it tonight at 3:00 when you wake up.

Any measurement that you make without the knowledge of its uncertainty is completely meaningless.

see http://ocw.mit.edu/OcwWeb/Physics/8-01Physics-IFall1999/VideoLectures/detail/embed01.htm

Second, based on the comments by the dedicated, competent, caring principal of my kids' school at Back to School Night, there is no question in my mind that the mechanics of the progress review score calculations are lost on the vast majority of administrators and (for that matter) parents.

Trevor: Different representations of data will be useful for addressing different questions. The confidence intervals show the range of possible population values for particular schools but do not, as you point out, show which schools are statistically distinguishable from other schools. A format that

doesshow this is colloquially referred to as a "pantyhose chart," with rows and columns consisting of squares representing a particular unit (states, for the National Assessment of Educational Progress in the USA, or countries in international assessments such as PISA). A square might be grey if the value of the column unit is significantly larger than the value of the row unit, or clear if the values are statistically indistinguishable, or black if the value of the column unit is significantly smaller than the value of the row unit. If the rows and columns are arrayed from high to low, the chart will have the jagged configuration of a pantyhose sizing chart.A pantyhose chart, however, only indicates which units are statistically distinguishable from one another, and says nothing about either the point estimates for particular units or the uncertainty around those estimates. Thus, what features of the data you wish to highlight will dictate which type of chart will be must illuminating.

From Jim Liebman:

Aaron Pallas' recent post is interesting but misleading. The Progress Reports don’t use the outcomes of a sample of students; they average the actual outcomes of all students. PS14’s Progress Report didn't use a sample to estimate the number of students who would score higher this year than last year. It measured the exact proportion of all students who did score better this year. That percentage was not “anywhere from 51.3% to 65.3%.” It was 58.3%.

The Progress Report is less like the polls Mr. Pallas refers to, and more like an election. On Nov. 4th, either Barack Obama or John McCain will be elected President by a plurality in what may be a very close vote. Following Mr. Pallas, the results of the election are an estimate of people’s preferences – and an imperfect estimate at that; if the election were conducted five days in a row, the outcomes might differ by a few percentage points each day and waver back and forth between the candidates. The Nov. 4th outcome might even differ from the average of the five tries – perhaps based on nothing more than the weather in Columbus.

Of course, we can’t wait for a “perfect” vote. We move forward over the next four years by virtue of, an imperfect estimate based on a single high-stakes test that we arbitrarily set for the first Tuesday in November each Leap Year. Actual outcomes on actual single occasions have real consequences.

Why elect a President based on a poll that doesn't estimate with certainty who the public truly wants to lead it for four years? The answer is that, following the Constitution, we want a President who, having known from the start what the method of election would be, had every incentive to demonstrate that her or his policies best serve the public's needs. Whoever loses the election, even if by a thin margin, will know that he had every incentive and chance to prove his worth to the electorate and get the votes needed to be elected. 49.3% may not be enough. 50.1% will be.

The goal of our Constitution and election laws isn't to measure perfectly what the public believes on a given day but instead to give candidates and elected officials the strongest incentive possible to address public needs.

The school Mr. Pallas mentions knew for a year exactly what it would take to get an A, B, or another grade. It had every incentive to move its students forward by a known amount in reading and math. 58.3% wasn’t enough. 60% was.

As a recent paper by Pallas' Columbia colleagues Rockoff and Turner (http://www0.gsb.columbia.edu/faculty/jrockoff/rockoff_turner_accountability_8_08.pdf) seems to indicate, the Progress Reports work as intended. They give schools a powerful motivation to move students forward. And schools – especially those whose longitudinal student progress was the lowest last year – responded by finding ways to demonstrably enable more kids to read and do math.

* * * * *

Mr. Pallas may advocate inaction in response to measurement imperfection in evaluating schools, even if not in electing Presidents. If so, he should look at Donald Rubin's paper (with Stuart & Zanutto), “A Potential Outcomes View of Value-Added Assessment in Education,” Journal of Educational and Behavioral Statistics (2004), together with Rockoff & Turner above. According to Rubin, statistical analysis should "focus[] on assessing the effect of implementing reward structures based on value-added models, rather than on assessing the effect of teachers and schools themselves . . . . The real question . . . is, do these descriptive measures, and . . . reward systems based on them, improve education?" Analyzing the effect of DOE's Progress Reports, Rockoff and Turner find that a NYC school's "receipt of a low [Progress Report] grade [in 2007] significantly increased student achievement in both subjects [in 2008], with larger effects in math."

* * * * *

By the way, the NCLB and virtually every other accountability system in the world has attributes similar to the ones Mr. Pallas criticizes in the Progress Reports.

Jim/David,

It is deeply ironic that today, of all days, you are making the argument that uncertainty doesn't matter. Because this morning, the DOE made the argument that it does.

I believe that Liebman's own office had a hand in preparing reports of teacher value-added, which explicitly provide a confidence interval around the estimate of teacher performance. Readers, you can see a sample teacher report here. And the results are quite stunning. A teacher with a value-added percentile of 65 has a confidence interval ranging from the 46th to the 84th percentile. In providing this range, the DOE is formally acknowledging that we do not know if this is a below average, average, or above average teacher.

Why does the DOE think it is appropriate to take into account uncertainty in some cases and not others?

I have to admit that I do not grasp Jim Liebman’s references to electing a President, and how New York City’s School Progress reports are more like an election than a poll. An election is based on “an actual outcome on actual single occasions,” in his words. Is that really what we want from our measures of how a school is performing? A good measure of how the school did—or, for that matter, how a child did—on November 4th, that might not reflect how the school or child was performing on other dates? If a student crams for a test, and does well on the test date, but the next day can’t remember the material examined on the test, would you be happy? I certainly wouldn’t.

I regret that Mr. Liebman either was unfamiliar with or did not follow the link in my post to eduwonkette’s discussion of why sampling error matters even in settings where an entire population (or nearly the entire population) is tested. I won’t repeat eminent psychometrician Dan Koretz’ explanation of why this is largely a settled issue among experts in the field, but I will quote another renowned authority on testing, the late Lee J. Cronbach, and his colleagues Norman Bradburn and Daniel Horvitz:

Mr. Liebman’s final comment is that “NCLB and virtually every other accountability system in the world has attributes similar to the ones” that I criticize in the Progress Reports. Let’s be clear: I’m criticizing the failure of the Progress Reports to address the uncertainty inherent in the measures that are components of the overall score, including sampling error. Let’s not talk about “every other accountability system in the world”; let’s focus on NCLB. Secretary of Education Margaret Spellings’ (2005) document

No Child Left Behind: A Road Map for State Implementationexplicitly allows states to use confidence intervals in the determination of whether a school is making Adequate Yearly Progress (AYP): “Other states may want [sic] employ a statistical test to increase confidence in AYP determinations. In smaller states, or states with many small schools, such a test—a ‘confidence interval’—can help guard against making significant accountability decisions based on fluctuations in school performance or on the assessment results from a relatively small group of students” (p. 8).The Center on Education Policy’s (2007) report

No Child Left Behind at Five: A Review of Changes to State Accountability Plansstates that between 2004 and 2006, 31 states added or made changes in the use of confidence intervals for calculating a school’s AYP status. This, coupled with the states that had included confidence intervals in their original accountability plans, “means that virtually all states now use confidence intervals in some form,” a conclusion echoed in the Council of Chief State School Officers’ report on the 2007 state educational accountability plan amendments.One of these states is the state of New York, which I am informed contains the New York City school district which employs Mr. Liebman. New York’s approved accountability plan, available on the U.S. Department of Education website, states, “To minimize the chance that a district or school erroneously will be deemed to have not made adequate yearly progress, New York State’s accountability system uses a ‘confidence interval’ to determine whether a group has met its Annual Measurable Objective (AMO). A confidence interval recognizes the sampling error associated with an observed score and permits the analyst to determine whether the difference between the observed Performance Index (PI) and the AMO falls within certain bounds (that is, within the margin of error attributable to random sampling error) or whether that difference falls outside of the margin of error and is, therefore, not attributable to chance alone.”

So who is out of step here—the New York City Department of Education, which fails to do what virtually every state in the union does, including the state of New York, or me? You be the judge.

Thanks, eduwonkette, for the link to the sample teacher report. Yikes.

In an 8 April 2008 op-ed for the Daily News, Klein wrote:

"A teacher's impact on her students' standardized test scores shouldn't be the only factor used in deciding whether or not to give tenure, and it should never be used as part of a hard formula. Nor should test scores be used without controlling for things like where students start academically, class size and demographics.

"But if these general guidelines are followed, there is nothing wrong - and everything right - with using the achievement gains of students, as measured through standardized tests, to assess the quality of a teacher's work."

By contrast, the recent joint letter from Klein and Weingarten states:

"We wish to be clear on one point: the Teacher Data Reports are not to be used for evaluation purposes. That is, they won’t be used in tenure determinations or the annual rating process. Administrators will be specifically directed accordingly. These reports, instead, are designed to help you pinpoint your own strengths and weaknesses, and empower you, working with your principal and colleagues, to devise strategies to improve."

On the one hand, Klein appears to have changed his mind on the uses of the reports. On the other hand, there is no change at all. Both statements seem to assume that the teacher reports contain accurate information, whether such information be used to "pinpoint" or to evaluate.

Are these reports capable of "pinpointing" our strengths and weaknesses for us? Do they trump our own insights?

I find it fascinating -- and extremely disturbing -- that Jim Liebman, a man who made his academic reputation showing the fallibility of human judgment in the case of those accused of capital crimes should be so convinced of his own infallibility, and that of the extremely unreliable system that he devised to judge schools -- as well as so blind to its negative effects.

One thing that eludes me in all these discussions of the flaws of the NYC school evaluations is that the ELA and Math tests, at least in the elementary grades, are norm referenced tests, not criterion, even though the 1-4 grading scale associates them with rubric scoring. Therefore 50% of students will always have 1 or 2. Doesn't this make the task of labeling and punishing "failing" schools a shell game? And if we're going to talk about uncertainty, what does a score of 1, 2, or 3 really tell us a student knows and can do? All it really tells us is 25% of the population presumably guessed more correct answers than the rest of them.

Alexandra: I would not agree with your characterization of the New York State ELA and Math tests. Because these tests are designed to measure grade-specific learning standards, they are much more akin to criterion-referenced tests than norm-referenced tests, even though the setting of the threshold for proficiency at a given grade level is arbitrary. It's not true that 50% of students will always score 1 or 2 on the grades 3-8 ELA and Math tests. In 2007, more than 50% of students taking the ELA scored at Level 3 or 4 in every grade from 3 to 8.