Come on Feel the Noise!
Last week, skoolboy and I took to our computers with the newly released data. Of particular concern is the progress measure, which makes up 60% of a school’s grade. Both skoolboy and Dan Koretz have already identified serious flaws in DOE’s test progress model. Even in the absence of these problems, we know that all models of year-to-year growth must contend with measurement error present in two different tests.
What the heck is measurement error? Bear with us for two paragraphs, because this is critical to understanding the central problem with the Progress Reports. A test score is just a proxy for students' underlying skills and competencies. If you give a student a test, the test score represents the combination of her "true" level of skills plus measurement error. This error may be a function of idiosyncratic factors like not eating breakfast (which might hurt your score), having the good fortune of having studied the material that happens to be on the test (which would increase your score over your true level of skill), or a dog barking during the test (which might decrease the scores of all students in a classroom). A "gain score" represents the difference between two test scores, both of which are measured with error, so they provide noisy estimates.
If measurement error was constant, then it would just cancel out when we difference the two scores. But we know that measurement error is likely to be random – the two errors do not just cancel out. Another kind of error stems from sampling variation, which I have discussed here before. In short, the more measurement error (or “noise”) in the results, the harder it is to detect the “signal” that represents a school’s actual contribution to growth in student learning.
In what follows, we demonstrate that there is almost no relationship between NYC schools' progress scores in 2007 and 2008. The progress measure, it appears, is a fruitless exercise in measuring error rather than the value that schools themselves add to students. If we believe that the Progress Reports are in the business of cleanly identifying schools that consistently produce more or less progress, this finding is rather troublesome.
First, some sunnier results: Below, we provide scatterplots of the relationship between the overall environment and performance-level scores in 2007 and 2008 for the 566 elementary schools that received overall grades in both years. In both cases, last year’s score is a strong predictor of this year’s score. To quantify the extent to which two variables move together, we can make use of a measure called a correlation coefficient. A correlation of 0 implies that the variables have no relationship, while a correlation of 1 represents a perfect positive relationship. We find that the correlation is .82 for the performance score and .75 for the environment score. This is exactly what we would expect – schools’ performance or climates do not wildly change from year to year.

But the relationship between the 2007 and 2008 progress scores is quite different – the correlation is -.02. In other words, there is almost no relationship! This is precisely what we would expect to see if the growth measures were primarily capturing measurement error. (These correlations are still low, but slightly larger, for K-8 and middle schools - the correlations were .11 and .15, respectively.)

We are left with three possible explanations:
1) The poorly constructed progress measure is simply measuring noise.New Yorkers are left with three courses of action:
2) The DOE somewhat tweaked the progress measure for this year, so the results are not comparable.
3) The receipt of and publicity around last year’s progress measures fundamentally changed how New York City’s elementary schools do business, so that schools that were more successful in raising student achievement in 2007 suddenly became less so, and schools that were less successful in raising student achievement in 2007 suddenly became more so.
* If explanation 1 is correct, we should ignore these report cards altogether because they are primarily (60%) measuring error.Thanks to skoolboy’s masterful analysis of the data, we present evidence below the fold to suggest that the likely culprit is measurement error. The evidence is not conclusive, because every single element of the progress measure—and there are 16 of them in this year’s student progress measure—changed slightly from last year to this year. The strategy that we pursue below is to compare those elements of the progress measure that were used in both years - for example, the percentage of students making at least one year of progress, or the average change in proficiency scores. Again, we stress that these measures were not identical across years, but one would expect them to be moderately related. Needless to say, that is not what we found. We think it extremely unlikely, given these analyses described in detail below, that this is simply due to a tweaking of the progress report measures.
* If explanation 2 is correct, we should not compare schools' grades in 2007 with their grades in 2008, because they are measuring fundamentally different dimensions of school performance. In this case, the collective hysteria that has ensued in NYC schools last week about why grades are up or down is all for naught.
* And if explanation 3 is correct, eduwonkette and skoolboy should shut up and get out of the way of the silent revolution that has transformed public schooling in New York City.
And what of the third explanation—a fundamental overhaul in the effectiveness of New York City’s elementary and middle schools over the past year that reshuffled the effective and ineffective schools? Magical transformations that shift schools from low to high-progress, or vice versa, are the fabled stuff of Hollywood movies, not reality. Real school change, unfortunately, is not an overnight affair.
Where does this leave NYC parents, teachers, and principals, all of whom are trying to make sense of what these measures mean? Bottom line: It's impossible to know what your A or your F means, because these grades are dominated by random error. Let's hope that the DOE heads back to the drawing board rather than continuing to defend the indefensible.
A key measure in both last year’s and this year’s student progress measure is the percentage of students making at least one year of progress in ELA and in Math, where a year of progress is defined as attaining the same or higher proficiency rating in 2008 in the subject as the student received in 2007, with a minimum proficiency rating of 2.00 in 2008. Three changes to this are new this year: (a) if a student scored at Level IV in both 2007 and 2008, that student is counted as making one year of progress, even if the proficiency rating declined from 2007 to 2008 (b) all students who were designated Special Education in 2007 receive a +0.2 addition to their 2007 proficiency rating before calculating whether a year of progress was achieved; and (c) any middle school student earning an 85 or higher on the Math A or Integrated Algebra Regents exam is automatically classified as making one year of progress in Math.
For elementary schools, the correlation between the peer horizon score for the percentage of students making at least one year of progress in ELA in 2007 and in 2008 is -.10, and the correlation for the citywide horizon score over the two years is -.09. There is essentially no stability over time in which elementary schools were successful in advancing their students a year in ELA achievement. The story is even more surprising at the K-8 and middle school levels; the K-8 peer horizon correlation is -.15, and citywide horizon correlation is -.16, whereas the middle school peer horizon correlation is -.24, and citywide horizon correlation is .01.
The stability in a school’s ability to advance its students a year of progress in Math in 2007 and 2008 is a bit higher, especially at the middle school level. For elementary schools, the correlation of the peer horizon score in 2007 and 2008 is .09, and for the citywide horizon score it’s .16. Among K-8 schools, the peer horizon score correlates -.03, and the citywide horizon score correlates .11. The greatest stability is seen at the middle school, where the over-time correlation for the Math peer horizon score is .33, and for the citywide horizon score is .32.
We did the same kind of over-time calculation for the average change in proficiency scores from 2007 to 2008, which also involved the Special Education adjustment in 2008. Five of the six correlations for the average change in ELA proficiency, which range from -.16 to -.37, are negative and statistically significant. What this means is that the schools that were judged to be more effective in raising students’ ELA proficiency in the 2007 report card were significantly less successful in producing ELA gains in 2008 than the schools that were less effective in 2007.
At best, there is no correlation over time in the DOE’s reports of which schools are good at inducing growth in ELA achievement. At worst, the DOE’s system finds that the schools that were better than average in 2007 were actually worse than average in 2008.


Comments
Huh!
Posted by: Cleo | September 22, 2008 1:20 PM
Really great reasoning: It can't be reason 3 because, uh, schools can't improve. Which we'll show you by telling you this accountability system doesn't work. We know this because, uh, it just can't!
Maybe they'll make a movie about the NYC DOE one day.
Posted by: Socrates | September 22, 2008 2:03 PM
Hi Socrates,
Yes, the movie will be called "The Smartest Guys in the Room, the Sequel."
If schools did improve so dramatically as we posit in explanation 3, we should also see a very weak correlation on the overall performance measure. We don't. The correlation between the 2007 and 2008 performance score is .82.
Posted by: eduwonkette | September 22, 2008 2:46 PM
Imagine what we could do if the DOE spent all that time, energy and MONEY on working in smart ways (by trimming the fat, listening to teachers and thinking of ways to be honest about the progress and meaures we use) rather than thinking of ways to defend itself. Because their idea sucks.
Can you imagine if a lowly employee of the DOE (e.g. a teacher....gasp!) spent all that time defending her poorly administered assessments? Or, if they had their way, we spent ALL our time assessing, making graphs and defending said graphs? No, you can't because then NOTHING WOULD GET DONE.
Hmmmm....maybe the guys at the top could learn something from those of us at the bottom.
Posted by: Mimi | September 22, 2008 4:44 PM
You state:
"What this means is that the schools that were judged to be more effective in raising students’ ELA proficiency in the 2007 report card were significantly less successful in producing ELA gains in 2008 than the schools that were less effective in 2007."
Could this be due to (a) a ceiling effect or (b) the psychometric properties of the test?
If there is a ceiling effect, then the high performers will be constrained in the second year. I don;t know enough about the test to know whether this is possibly the case.
I have seen some tests that are constructed in such a way that getting one additional question correct at the lower end of the distribution of scores translates into a greater scale score gain than getting one additional question correct at the high end of the distribution. This could cause some funny things to happen as well.
QUESTION: In your graph, what are the progress category scores? An increase or decrease in proficiency levels? Maybe you described this, but I missed it. Please explain.
Posted by: Ed Fuller | September 22, 2008 5:57 PM
We don't think this is due to a ceiling effect, because the DOE explicitly tried to adjust for possible ceiling effects by defining any student who was at Level IV in both 2007 and 2008 as having made a year of progress, even if that student's scale score in 2008 was lower than his/her scale score in 2007.
Dan Koretz' and my posts last week speak to the flaws in the DOE's process of converting scale scores to proficiency scores, but I don't think that the conversion of raw scores to scale scores is the culprit here. It's true that very high and very low scale scores are less reliable than scores in the middle of the distribution.
It's hard to explain the calculation of the progress scores succinctly. Here goes: They are a weighted average of the relative position of a school's progress in relation to the minimum and maximum progress observed in a group of 40 similar peer schools at the same grade level, and the relative position of a school's progress in relation to the minimum and maximum progress observed among schools at the same grade level citywide. A school whose student progress was higher than all of the 40 schools in its peer group would receive a progress score for that measure of about 1; a school whose student progress was lower than all of the 40 schools in its peer group would receive a progress score for that measure of about 0.
If that sounds confusing, it should be.
Posted by: skoolboy | September 22, 2008 6:14 PM
Eduwonkette and Skoolboy--
You rock! If there were public service awards in public education, I would nominate you both. Eduwonkette: you have earned your cape....Skoolboy: go to the head of the class.
Seriously folks, we have to keep the pressure on the DOE to live up to their commitment to data driven decision making. First, they need better data than the flawed State tests, second, they need better data analysis systems.
Question: when is someone going to sue NYS for tests that are not reliable or valid?
Posted by: Citizen X | September 22, 2008 8:59 PM
Interesting analysis. I'm not sure I understand why it measures error--crunching numbers is not my forte. But I will say from personal experience (as a principal) that the progress reports did put a lot of pressure on the lowest schools and at least at our school, we targeted specific areas that changed our score pretty dramatically.
I'm not defending the progress reports--I think they paint a very limited picture of a school, especially in the lower grades. But I don't think the DOE is wrong in its efforts to try to develop tools to quantify what schools are doing. Certainly, I know there are LOTS of schools that have run for years with a good reputation but that haven't actually moved kids very much at all. How do you unroot that?
Do you have ideas about better ways to measure academic progress in a way that rewards schools that don't just cherry-pick the "best" kids and then congratulate themselves on high test scores?
Posted by: janie | September 23, 2008 9:26 AM
Hi Janie,
I definitely have no problem with quantification, and I also applaud the DOE for trying to separate student inputs from what the school itself contributes. We shouldn't call schools "good" just because they had high performing kids to begin with.
Unfortunately, their system does not do a good job of identifying schools that spur more or less progress for their kids. Year-to-year growth is incredibly noisy - and principals and teachers know this best, because they see both the kids and the scores. How many times have we been surprised by a student that did worse than their true level of skill, or better than their true level?
Posted by: eduwonkette | September 24, 2008 10:23 AM
Eduwonkette,
I agree for sure that the scores of some students are not reflective of what I and their teachers know they can do. I guess I always assumed that we break even (some get lucky and score high, some have a bad day and score low) although I wonder if in a community with significant economic disadvantage, the bad days are more frequent?). But even that variation wouldn't seem to matter if schools are compared to each other--wouldn't they all have similar ratios of nonreflective scores?
But my last questions weren't meant to be rhetorical. As a principal of a school that got a poor grade, I've thought a lot about what they should measure to accurately grade schools. So many of the great things I see in my school are hard to quantify. Have you seen cities that have better systems?
Posted by: janie | September 24, 2008 1:27 PM
I am not sure whether this is a fourth explanation or some sort of union and intersect of the EW's and SB's explanations: The lower the school's quality measures are, the greater any small increase counts toward a progress grade. For example, School #1 had 100% proficient students last year and 100% proficient students this year. The school had a gain of 0%. School #1 earns an F! Meanwhile, School #2 had 1% proficient students last year and 2% proficient students this year. The second school experienced a doubled percentage of proficient students and therefore earns 100%.
Posted by: Kew Gardener | September 24, 2008 10:53 PM