« COWAbunga Award! | Main | What Does Educational Testing Really Tell Us? An Interview with Daniel Koretz »

Come on Feel the Noise!

Last week, New Yorkers scratched their heads and tried to make sense of the Progress Report results. What does it mean, for example, when 77% of schools that received an F last year jump to an A or a B? Michael Bloomberg has a resolute answer to this question, “Not a single school failed again....The fact of the matter is it’s working.”

Last week, skoolboy and I took to our computers with the newly released data. Of particular concern is the progress measure, which makes up 60% of a school’s grade. Both skoolboy and Dan Koretz have already identified serious flaws in DOE’s test progress model. Even in the absence of these problems, we know that all models of year-to-year growth must contend with measurement error present in two different tests.

What the heck is measurement error? Bear with us for two paragraphs, because this is critical to understanding the central problem with the Progress Reports. A test score is just a proxy for students' underlying skills and competencies. If you give a student a test, the test score represents the combination of her "true" level of skills plus measurement error. This error may be a function of idiosyncratic factors like not eating breakfast (which might hurt your score), having the good fortune of having studied the material that happens to be on the test (which would increase your score over your true level of skill), or a dog barking during the test (which might decrease the scores of all students in a classroom). A "gain score" represents the difference between two test scores, both of which are measured with error, so they provide noisy estimates.

If measurement error was constant, then it would just cancel out when we difference the two scores. But we know that measurement error is likely to be random – the two errors do not just cancel out. Another kind of error stems from sampling variation, which I have discussed here before. In short, the more measurement error (or “noise”) in the results, the harder it is to detect the “signal” that represents a school’s actual contribution to growth in student learning.

In what follows, we demonstrate that there is almost no relationship between NYC schools' progress scores in 2007 and 2008. The progress measure, it appears, is a fruitless exercise in measuring error rather than the value that schools themselves add to students. If we believe that the Progress Reports are in the business of cleanly identifying schools that consistently produce more or less progress, this finding is rather troublesome.

First, some sunnier results: Below, we provide scatterplots of the relationship between the overall environment and performance-level scores in 2007 and 2008 for the 566 elementary schools that received overall grades in both years. In both cases, last year’s score is a strong predictor of this year’s score. To quantify the extent to which two variables move together, we can make use of a measure called a correlation coefficient. A correlation of 0 implies that the variables have no relationship, while a correlation of 1 represents a perfect positive relationship. We find that the correlation is .82 for the performance score and .75 for the environment score. This is exactly what we would expect – schools’ performance or climates do not wildly change from year to year.


But the relationship between the 2007 and 2008 progress scores is quite different – the correlation is -.02. In other words, there is almost no relationship! This is precisely what we would expect to see if the growth measures were primarily capturing measurement error. (These correlations are still low, but slightly larger, for K-8 and middle schools - the correlations were .11 and .15, respectively.)


We are left with three possible explanations:
1) The poorly constructed progress measure is simply measuring noise.

2) The DOE somewhat tweaked the progress measure for this year, so the results are not comparable.

3) The receipt of and publicity around last year’s progress measures fundamentally changed how New York City’s elementary schools do business, so that schools that were more successful in raising student achievement in 2007 suddenly became less so, and schools that were less successful in raising student achievement in 2007 suddenly became more so.
New Yorkers are left with three courses of action:
* If explanation 1 is correct, we should ignore these report cards altogether because they are primarily (60%) measuring error.

* If explanation 2 is correct, we should not compare schools' grades in 2007 with their grades in 2008, because they are measuring fundamentally different dimensions of school performance. In this case, the collective hysteria that has ensued in NYC schools last week about why grades are up or down is all for naught.

* And if explanation 3 is correct, eduwonkette and skoolboy should shut up and get out of the way of the silent revolution that has transformed public schooling in New York City.
Thanks to skoolboy’s masterful analysis of the data, we present evidence below the fold to suggest that the likely culprit is measurement error. The evidence is not conclusive, because every single element of the progress measure—and there are 16 of them in this year’s student progress measure—changed slightly from last year to this year. The strategy that we pursue below is to compare those elements of the progress measure that were used in both years - for example, the percentage of students making at least one year of progress, or the average change in proficiency scores. Again, we stress that these measures were not identical across years, but one would expect them to be moderately related. Needless to say, that is not what we found. We think it extremely unlikely, given these analyses described in detail below, that this is simply due to a tweaking of the progress report measures.

And what of the third explanation—a fundamental overhaul in the effectiveness of New York City’s elementary and middle schools over the past year that reshuffled the effective and ineffective schools? Magical transformations that shift schools from low to high-progress, or vice versa, are the fabled stuff of Hollywood movies, not reality. Real school change, unfortunately, is not an overnight affair.

Where does this leave NYC parents, teachers, and principals, all of whom are trying to make sense of what these measures mean? Bottom line: It's impossible to know what your A or your F means, because these grades are dominated by random error. Let's hope that the DOE heads back to the drawing board rather than continuing to defend the indefensible.

A key measure in both last year’s and this year’s student progress measure is the percentage of students making at least one year of progress in ELA and in Math, where a year of progress is defined as attaining the same or higher proficiency rating in 2008 in the subject as the student received in 2007, with a minimum proficiency rating of 2.00 in 2008. Three changes to this are new this year: (a) if a student scored at Level IV in both 2007 and 2008, that student is counted as making one year of progress, even if the proficiency rating declined from 2007 to 2008 (b) all students who were designated Special Education in 2007 receive a +0.2 addition to their 2007 proficiency rating before calculating whether a year of progress was achieved; and (c) any middle school student earning an 85 or higher on the Math A or Integrated Algebra Regents exam is automatically classified as making one year of progress in Math.

For elementary schools, the correlation between the peer horizon score for the percentage of students making at least one year of progress in ELA in 2007 and in 2008 is -.10, and the correlation for the citywide horizon score over the two years is -.09. There is essentially no stability over time in which elementary schools were successful in advancing their students a year in ELA achievement. The story is even more surprising at the K-8 and middle school levels; the K-8 peer horizon correlation is -.15, and citywide horizon correlation is -.16, whereas the middle school peer horizon correlation is -.24, and citywide horizon correlation is .01.

The stability in a school’s ability to advance its students a year of progress in Math in 2007 and 2008 is a bit higher, especially at the middle school level. For elementary schools, the correlation of the peer horizon score in 2007 and 2008 is .09, and for the citywide horizon score it’s .16. Among K-8 schools, the peer horizon score correlates -.03, and the citywide horizon score correlates .11. The greatest stability is seen at the middle school, where the over-time correlation for the Math peer horizon score is .33, and for the citywide horizon score is .32.

We did the same kind of over-time calculation for the average change in proficiency scores from 2007 to 2008, which also involved the Special Education adjustment in 2008. Five of the six correlations for the average change in ELA proficiency, which range from -.16 to -.37, are negative and statistically significant. What this means is that the schools that were judged to be more effective in raising students’ ELA proficiency in the 2007 report card were significantly less successful in producing ELA gains in 2008 than the schools that were less effective in 2007.

At best, there is no correlation over time in the DOE’s reports of which schools are good at inducing growth in ELA achievement. At worst, the DOE’s system finds that the schools that were better than average in 2007 were actually worse than average in 2008.


Really great reasoning: It can't be reason 3 because, uh, schools can't improve. Which we'll show you by telling you this accountability system doesn't work. We know this because, uh, it just can't!

Maybe they'll make a movie about the NYC DOE one day.

Hi Socrates,

Yes, the movie will be called "The Smartest Guys in the Room, the Sequel."

If schools did improve so dramatically as we posit in explanation 3, we should also see a very weak correlation on the overall performance measure. We don't. The correlation between the 2007 and 2008 performance score is .82.

Imagine what we could do if the DOE spent all that time, energy and MONEY on working in smart ways (by trimming the fat, listening to teachers and thinking of ways to be honest about the progress and meaures we use) rather than thinking of ways to defend itself. Because their idea sucks.

Can you imagine if a lowly employee of the DOE (e.g. a teacher....gasp!) spent all that time defending her poorly administered assessments? Or, if they had their way, we spent ALL our time assessing, making graphs and defending said graphs? No, you can't because then NOTHING WOULD GET DONE.

Hmmmm....maybe the guys at the top could learn something from those of us at the bottom.

You state:

"What this means is that the schools that were judged to be more effective in raising students’ ELA proficiency in the 2007 report card were significantly less successful in producing ELA gains in 2008 than the schools that were less effective in 2007."

Could this be due to (a) a ceiling effect or (b) the psychometric properties of the test?

If there is a ceiling effect, then the high performers will be constrained in the second year. I don;t know enough about the test to know whether this is possibly the case.

I have seen some tests that are constructed in such a way that getting one additional question correct at the lower end of the distribution of scores translates into a greater scale score gain than getting one additional question correct at the high end of the distribution. This could cause some funny things to happen as well.

QUESTION: In your graph, what are the progress category scores? An increase or decrease in proficiency levels? Maybe you described this, but I missed it. Please explain.

We don't think this is due to a ceiling effect, because the DOE explicitly tried to adjust for possible ceiling effects by defining any student who was at Level IV in both 2007 and 2008 as having made a year of progress, even if that student's scale score in 2008 was lower than his/her scale score in 2007.

Dan Koretz' and my posts last week speak to the flaws in the DOE's process of converting scale scores to proficiency scores, but I don't think that the conversion of raw scores to scale scores is the culprit here. It's true that very high and very low scale scores are less reliable than scores in the middle of the distribution.

It's hard to explain the calculation of the progress scores succinctly. Here goes: They are a weighted average of the relative position of a school's progress in relation to the minimum and maximum progress observed in a group of 40 similar peer schools at the same grade level, and the relative position of a school's progress in relation to the minimum and maximum progress observed among schools at the same grade level citywide. A school whose student progress was higher than all of the 40 schools in its peer group would receive a progress score for that measure of about 1; a school whose student progress was lower than all of the 40 schools in its peer group would receive a progress score for that measure of about 0.

If that sounds confusing, it should be.

Eduwonkette and Skoolboy--

You rock! If there were public service awards in public education, I would nominate you both. Eduwonkette: you have earned your cape....Skoolboy: go to the head of the class.

Seriously folks, we have to keep the pressure on the DOE to live up to their commitment to data driven decision making. First, they need better data than the flawed State tests, second, they need better data analysis systems.

Question: when is someone going to sue NYS for tests that are not reliable or valid?

Interesting analysis. I'm not sure I understand why it measures error--crunching numbers is not my forte. But I will say from personal experience (as a principal) that the progress reports did put a lot of pressure on the lowest schools and at least at our school, we targeted specific areas that changed our score pretty dramatically.

I'm not defending the progress reports--I think they paint a very limited picture of a school, especially in the lower grades. But I don't think the DOE is wrong in its efforts to try to develop tools to quantify what schools are doing. Certainly, I know there are LOTS of schools that have run for years with a good reputation but that haven't actually moved kids very much at all. How do you unroot that?

Do you have ideas about better ways to measure academic progress in a way that rewards schools that don't just cherry-pick the "best" kids and then congratulate themselves on high test scores?

Hi Janie,
I definitely have no problem with quantification, and I also applaud the DOE for trying to separate student inputs from what the school itself contributes. We shouldn't call schools "good" just because they had high performing kids to begin with.

Unfortunately, their system does not do a good job of identifying schools that spur more or less progress for their kids. Year-to-year growth is incredibly noisy - and principals and teachers know this best, because they see both the kids and the scores. How many times have we been surprised by a student that did worse than their true level of skill, or better than their true level?

I agree for sure that the scores of some students are not reflective of what I and their teachers know they can do. I guess I always assumed that we break even (some get lucky and score high, some have a bad day and score low) although I wonder if in a community with significant economic disadvantage, the bad days are more frequent?). But even that variation wouldn't seem to matter if schools are compared to each other--wouldn't they all have similar ratios of nonreflective scores?

But my last questions weren't meant to be rhetorical. As a principal of a school that got a poor grade, I've thought a lot about what they should measure to accurately grade schools. So many of the great things I see in my school are hard to quantify. Have you seen cities that have better systems?

I am not sure whether this is a fourth explanation or some sort of union and intersect of the EW's and SB's explanations: The lower the school's quality measures are, the greater any small increase counts toward a progress grade. For example, School #1 had 100% proficient students last year and 100% proficient students this year. The school had a gain of 0%. School #1 earns an F! Meanwhile, School #2 had 1% proficient students last year and 2% proficient students this year. The second school experienced a doubled percentage of proficient students and therefore earns 100%.

Comments are now closed for this post.


Recent Comments

  • Kew Gardener: I am not sure whether this is a fourth explanation read more
  • janie: Eduwonkette, I agree for sure that the scores of some read more
  • eduwonkette: Hi Janie, I definitely have no problem with quantification, and read more
  • janie: Interesting analysis. I'm not sure I understand why it measures read more
  • Citizen X: Eduwonkette and Skoolboy-- You rock! If there were public service read more




Technorati search

» Blogs that link here


8th grade retention
Fordham Foundation
The New Teacher Project
Tim Daly
absent teacher reserve
absent teacher reserve

accountability in Texas
accountability systems in education
achievement gap
achievement gap in New York City
acting white
AERA annual meetings
AERA conference
Alexander Russo
Algebra II
American Association of University Women
American Education Research Associatio
American Education Research Association
American Educational Research Journal
American Federation of Teachers
Andrew Ho
Art Siebens
Baltimore City Public Schools
Barack Obama
Bill Ayers
black-white achievement gap
books on educational research
boy crisis
brain-based education
Brian Jacob
bubble kids
Building on the Basics
Cambridge Education
carnival of education
Caroline Hoxby
Caroline Hoxby charter schools
cell phone plan
charter schools
Checker Finn
Chicago shooting
Chicago violence
Chris Cerf
class size
Coby Loup
college access
cool people you should know
credit recovery
curriculum narrowing
Dan Willingham
data driven
data-driven decision making
data-driven decision-making
David Cantor
Dean Millot
demographics of schoolchildren
Department of Assessment and Accountability
Department of Education budget
Diplomas Count
disadvantages of elite education
do schools matter
Doug Ready
Doug Staiger
dropout factories
dropout rate
education books
education policy
education policy thinktanks
educational equity
educational research
educational triage
effects of neighborhoods on education
effects of No Child Left Behind
effects of schools
effects of Teach for America
elite education
Everyday Antiracism
excessed teachers
exit exams
experienced teachers
Fordham and Ogbu
Fordham Foundation
Frederick Douglass High School
Gates Foundation
gender and education
gender and math
gender and science and mathematics
gifted and talented
gifted and talented admissions
gifted and talented program
gifted and talented programs in New York City
girls and math
good schools
graduate student union
graduation rate
graduation rates
guns in Chicago
health benefits for teachers
High Achievers
high school
high school dropouts
high school exit exams
high school graduates
high school graduation rate
high-stakes testing
high-stakes tests and science
higher ed
higher education
highly effective teachers
Houston Independent School District
how to choose a school
incentives in education
Institute for Education Sciences
is teaching a profession?
is the No Child Left Behind Act working
Jay Greene
Jim Liebman
Joel Klein
John Merrow
Jonah Rockoff
Kevin Carey
KIPP and boys
KIPP and gender
Lake Woebegon
Lars Lefgren
leaving teaching
Leonard Sax
Liam Julian

Marcus Winters
math achievement for girls
meaning of high school diploma
Mica Pollock
Michael Bloomberg
Michelle Rhee
Michelle Rhee teacher contract
Mike Bloomberg
Mike Klonsky
Mike Petrilli
narrowing the curriculum
National Center for Education Statistics Condition of Education
new teachers
New York City
New York City bonuses for principals
New York City budget
New York City budget cuts
New York City Budget cuts
New York City Department of Education
New York City Department of Education Truth Squad
New York City ELA and Math Results 2008
New York City gifted and talented
New York City Progress Report
New York City Quality Review
New York City school budget cuts
New York City school closing
New York City schools
New York City small schools
New York City social promotion
New York City teacher experiment
New York City teacher salaries
New York City teacher tenure
New York City Test scores 2008
New York City value-added
New York State ELA and Math 2008
New York State ELA and Math Results 2008
New York State ELA and Math Scores 2008
New York State ELA Exam
New York state ELA test
New York State Test scores
No Child Left Behind
No Child Left Behind Act
passing rates
picking a school
press office
principal bonuses
proficiency scores
push outs
qualitative educational research
qualitative research in education
quitting teaching
race and education
racial segregation in schools
Randall Reback
Randi Weingarten
Randy Reback
recovering credits in high school
Rick Hess
Robert Balfanz
Robert Pondiscio
Roland Fryer
Russ Whitehurst
Sarah Reckhow
school budget cuts in New York City
school choice
school effects
school integration
single sex education
small schools
small schools in New York City
social justice teaching
Sol Stern
Stefanie DeLuca
stereotype threat
talented and gifted
talking about race
talking about race in schools
Teach for America
teacher effectiveness
teacher effects
teacher quailty
teacher quality
teacher tenure
teachers and obesity
Teachers College
teachers versus doctors
teaching as career
teaching for social justice
teaching profession
test score inflation
test scores
test scores in New York City
testing and accountability
Texas accountability
The No Child Left Behind Act
The Persistence of Teacher-Induced Learning Gains
thinktanks in educational research
Thomas B. Fordham Foundation
Tom Kane
University of Iowa
Urban Institute study of Teach for America
Urban Institute Teach for America
value-added assessment
Wendy Kopp
women and graduate school science and engineering
women and science
women in math and science
Woodrow Wilson High School