« Welcome "Urban Angle" to the Blogosphere | Main | When Measuring Achievement Gaps, Beware the Proficiency Trap »

Are New York City Schools Shortchanging High Achieving Students? The View from 2003-2008

Savvy New York City parents have long suspected that high achieving kids are losing out in the push to boost the achievement of the lowest performing students. But those suspicions are often cast aside by public officials as helicopter parent whining or muted class warfare.

But a review of 4th grade test score data from 2003-2008 suggests that these parents have been on to something. Between 2003 and 2008, the fraction of students scoring in the highest achievement level on the 4th grade NY state ELA test has plummeted.

In 2003, 15.6% of 4th graders scored at Level 4. By 2008, only 5.8% did. In other words, the fraction of students scoring at Level 4 in 2003 was about 2.7 times higher than this year. At the same time, the percentage of students scoring at proficiency has increased 9 percentage points, from 52.4% to 61.3%.

Put bluntly, it appears that schools are focusing on pushing lower performing students over the passing mark, and shortchanging high-achieving students in the process. In Bloomberg's New York, as it turns out, a rising tide does not lift all boats.


You can find the data from 1999-2005 here, and the data from 2006-2008 here. I analyzed 4th grade scores because tests weren't given in grades 3-5 throughout the entire time period. If anyone knows where to find average scale scores at different parts of the distribution over time (i.e. 10th/90th percentile) - I would have preferred to work with these data for all of the reasons suggested below - please let me know.

How can you have two forms of a test with the same average scaled score, the same cut score, more students proficient or above, but fewer students advanced? If teachers stop teaching the more difficult material/questions in the item bank and focus solely on maintaining performance on the easier items, you get the exact result described.

Imagine a ten item test with a scaled score from 1 to 10. At the end of this post are two mini data sets for those with a statistical bent. Each test has ten questions (positions 1-10) and a student ID (positions 11-12, a two digit number from 1 to 25). The multiple-choice items have already been scored with 0 for wrong and 1 for right. You might notice that the items to the left are more difficult (more 0's) than the items to the right (more 1's). Let's call the two data sets OLD TEST and NEW TEST.

Imagine these two tests were built to be equally difficult based on a bank of items calibrated at the same time. However, you need to verify that equivalence when you administer the new form. You may have developed that bank before the test had stakes attached OR you field tested the items when everyone knew it was just a field test OR these particular items performed differently when you put them together on the same form OR you field tested the items with small, unreliable, non-representative samples of students. Any of these could cause some trouble later. In practice, you might need to adjust the NEW TEST and any subsequent test with an equating constant. If real world use of the test reveals that the items you selected were easier than you had thought, you would need to adjust the test so that students had to get more questions right to pass. If real world use of the test reveals that the items you selected were harder than you had estimated, you would adjust the scale down so that students needed to earn fewer raw score points to pass.

If you enter the data sets below into your favorite software, you can calculate a total score by summing the first 10 columns of each data set. Let's call them OLD TOTAL and NEW TOTAL. In this case, our estimate that the two forms were equally difficult didn't seem to work very well. The OLD TOTAL mean was 5.68, but the NEW TOTAL mean is 4.68, a whole point more difficult. We might wonder why and could think of all kinds of reasons, but in the end we would need to adjust the NEW TEST scale to be certain the test was fair.

Finally, imagine our ten point test has two cut scores: one for proficient (at least 4 or higher) and one for advanced (9 or 10). On the OLD TEST, the distribution of scores was:

1 4
2 8
3 8
4 8
5 4
6 28
7 28
8 4
9 4
10 4

The percent of students proficient or better (4 or more) was 80% and the percent advanced (9 or more) was 8%.

On the NEW TEST, before the adjustment of 1 point for equating, the scores were:

1 4
2 8
3 12
4 4
5 48
6 16
7 4
8 4

On the NEW TEST, after the equating adjustment, the scores are:

2 4
3 8
4 12
5 4
6 48
7 16
8 4
9 4

The percent of students proficient or better is now 88% (up 8% from the OLD TEST) and the percent advanced is now 4% (down 4% from the OLD TEST). Thus, a test with the same total score mean (both are now 5.68), same cuts (4 and 9), more proficient (up 8%), and fewer advanced (down 4%). Did the students below proficient actually move up based on their own achievement, or merely get a statistical bump from the poorer performance of better students on items no longer taught. Who knows?

If you look carefully at the individual items in the data sets, or run them through your software, you will see that both tests had many very easy questions on which both groups of students performed well. Students even seemed to perform equally well on the really difficult questions. But the mid-level difficulty questions, slightly more difficult than a proficient student could answer correctly, have fallen sharply and appear much more difficult. That is consistent with the practice of teachers drilling the easy level questions, teachers ignoring medium to higher level difficulty questions, and some students getting a lucky guess on the really tough questions. So, did the students who are now proficient really deserve to be, or did they merely get a lucky bounce from the fact that the new form seems more difficult because moderately difficult material is no longer taught?

In the real world, how could you tell? Look at the difference in item difficulties between the old and new form. Did only the medium to very difficult questions seem to be harder and make the new form look harder overall? Was there a difference in the item-total correlations for the medium to very difficult items between field test and real use? If those items are no longer taught, students with higher overall scores might be left to guess. You might actually need to do some kind of curriculum audit to verify what is being taught in the classrooms, if you want to be secure in your interpretation of the test.

In the past two weeks, we have had significant research reports indicating that smarter students are being left behind, but proficiency rates are up overall. The NY situation presents a seemingly perplexing trend: same mean, same cut score, more proficient, fewer advanced. But such strange results are entirely consistent with a maniacal focus on proficiency at the expense of all else and, as the data here suggest, may be exactly what we should expect.



I noticed something strange about the 2007 figures. They don't add up properly.

The 2006-2008 table gives, for each grade level, and for each year from 2006-2008, the total number of test takers, and then the total number of level 1, 2, 3, and 4 scores (as well as the number of 3+4 scores).

The numbers in each row should add up to the total. For instance, if the total number of third-grade test takers in a given year was x, then the numbers of students scoring 1, 2, 3, and 4 in that year would add up to x. There is no other score, so how could it be otherwise?

Everything adds up except for the 2007 figures, which do not add up for any grade level. This is what I mean:

In 2007, 71045 third graders were tested.
9249 scored at level 1.
21735 scored at level 2.
35695 scored at level 3.
4375 scored at level 4.

Now add up 9249+21735+35695+4375. You get 71054, not 71045.

Do this for any of the grade levels for 2007, and you will see a discrepancy. Try it for 2006 or 2008, any grade level, and everything will add up just fine.

What is going on with these 2007 figures?

Could we get a brief primer sometime on test score terminology? I know the basic difference between "standards-based" and "norm-referenced," but I'd like to have a clearer understanding of terms like "scale score" and "cut score?"

What I'm curious about is if the test scores are so wrong, what's going to happen with middle school and high school application processes that rely, in some cases, almost exclusively on 4th grade and 7th grade test scores.

For middle schools, many of the best ones in District 2 say if your child doesn't have 4s in both ELA and Math you probably won't get in. However, now with fewer students getting the coveted 4s, will this open up the process or will they start relying on their own tests more? In other words, are our kids going to have to take more tests.

The middle school application process this year had a slew of its own problems. I'm just really wondering what is going to happen next year. Of course no one will ever know because no one will ever tell us.

Eduwonkette, I know nothing about statistics and marvel at your expertise with all this, so with trepidation I would like to ask a question in any case.

If the percentage of high scores has gone down over the years, do you think it's attributable to some educationally savvy parents pulling their kids out of what they see as a dysfunctional public system and putting them into private schools when they can afford it?

Obviously, small classes and more individualized attention make for better learners. So do well-supplied and well-maintained buildings, good lunches, musical instruments, textbooks, and libraries.

How much of our NYC school population is running away from NYC city schools?

Personal anecdote: I taught a few years ago in a California public school. We (teachers) were specifically instructed in staff meetings by the principal to target the "cusp" students: those who were on the edge of passing from basic to proficient on the state proficiency tests.

We were instructed in the staff meetings to circle the students on our class rosters with "cusp" test scores. We were then expected to focus more strongly on these students, and come up with instructional strategies to help them get over the bubble into the next "category."

This is an unintended consequence of basis school AYP scores (and their consequences) on the students' score categories - rather than the average scores of the students. Principals will always be tempted to focus more on ways to get the "basic" kids up to "proficient" - we were basically told to ignore the "below basic" kids b/c they were not likely to make proficiency anyway. I think this is a very odd and unfortunate way to teach.

Of course, teachers who complained about this or resisted were "insubordinate." Another reason leading to my decision to leave teaching to pursue other career alternatives...

...those suspicions are often cast aside by public officials as helicopter parent whining or muted class warfare

This is the first time I have seen the expression "helicopter parenting" defined as a term of opprobrium used by administrators and educators to criticize parents and evade accountability.

I'm grateful.

I'd like to raise a question concerning NY tests administered in 6th, 7th, and 8th grades.

We've discovered that there is practically no range of scores for the 4. In other words, the cut score between the 3 and the 4 has been set high. (I believe that's the correct way to put it -- ?)

When our son reached middle school, he sank from all 4s to all 3s on the state tests (finally managed a 4 on the 8th grade math, but that's it), apparently experiencing a middle grades slump.

However, his raw scores were in the 90s on a scale of 1 to 100.

To illustrate, in one year he received subscores of 96, 93, and 90 on the ELA exam, each of which was "Above the Target Range." And yet his scaled score was 15 points below the cut-off for a 4.

Two years ago he scored at the 95th percentile in reading on the ITBS; this year he was at the 97th percentile on the ISEE.

And yet he's 15 points below the cut-off for a 4 on the state tests.

What is going on here?

I haven't taken the time to look at the 4th grade tests, so I don't know whether the situation is similar there.

Comments are now closed for this post.


Recent Comments

  • Catherine Johnson: I'd like to raise a question concerning NY tests administered read more
  • Catherine Johnson: ...those suspicions are often cast aside by public officials as read more
  • Attorney DC: Personal anecdote: I taught a few years ago in a read more
  • woodlass: Eduwonkette, I know nothing about statistics and marvel at your read more
  • Pam: What I'm curious about is if the test scores are read more




Technorati search

» Blogs that link here


8th grade retention
Fordham Foundation
The New Teacher Project
Tim Daly
absent teacher reserve
absent teacher reserve

accountability in Texas
accountability systems in education
achievement gap
achievement gap in New York City
acting white
AERA annual meetings
AERA conference
Alexander Russo
Algebra II
American Association of University Women
American Education Research Associatio
American Education Research Association
American Educational Research Journal
American Federation of Teachers
Andrew Ho
Art Siebens
Baltimore City Public Schools
Barack Obama
Bill Ayers
black-white achievement gap
books on educational research
boy crisis
brain-based education
Brian Jacob
bubble kids
Building on the Basics
Cambridge Education
carnival of education
Caroline Hoxby
Caroline Hoxby charter schools
cell phone plan
charter schools
Checker Finn
Chicago shooting
Chicago violence
Chris Cerf
class size
Coby Loup
college access
cool people you should know
credit recovery
curriculum narrowing
Dan Willingham
data driven
data-driven decision making
data-driven decision-making
David Cantor
Dean Millot
demographics of schoolchildren
Department of Assessment and Accountability
Department of Education budget
Diplomas Count
disadvantages of elite education
do schools matter
Doug Ready
Doug Staiger
dropout factories
dropout rate
education books
education policy
education policy thinktanks
educational equity
educational research
educational triage
effects of neighborhoods on education
effects of No Child Left Behind
effects of schools
effects of Teach for America
elite education
Everyday Antiracism
excessed teachers
exit exams
experienced teachers
Fordham and Ogbu
Fordham Foundation
Frederick Douglass High School
Gates Foundation
gender and education
gender and math
gender and science and mathematics
gifted and talented
gifted and talented admissions
gifted and talented program
gifted and talented programs in New York City
girls and math
good schools
graduate student union
graduation rate
graduation rates
guns in Chicago
health benefits for teachers
High Achievers
high school
high school dropouts
high school exit exams
high school graduates
high school graduation rate
high-stakes testing
high-stakes tests and science
higher ed
higher education
highly effective teachers
Houston Independent School District
how to choose a school
incentives in education
Institute for Education Sciences
is teaching a profession?
is the No Child Left Behind Act working
Jay Greene
Jim Liebman
Joel Klein
John Merrow
Jonah Rockoff
Kevin Carey
KIPP and boys
KIPP and gender
Lake Woebegon
Lars Lefgren
leaving teaching
Leonard Sax
Liam Julian

Marcus Winters
math achievement for girls
meaning of high school diploma
Mica Pollock
Michael Bloomberg
Michelle Rhee
Michelle Rhee teacher contract
Mike Bloomberg
Mike Klonsky
Mike Petrilli
narrowing the curriculum
National Center for Education Statistics Condition of Education
new teachers
New York City
New York City bonuses for principals
New York City budget
New York City budget cuts
New York City Budget cuts
New York City Department of Education
New York City Department of Education Truth Squad
New York City ELA and Math Results 2008
New York City gifted and talented
New York City Progress Report
New York City Quality Review
New York City school budget cuts
New York City school closing
New York City schools
New York City small schools
New York City social promotion
New York City teacher experiment
New York City teacher salaries
New York City teacher tenure
New York City Test scores 2008
New York City value-added
New York State ELA and Math 2008
New York State ELA and Math Results 2008
New York State ELA and Math Scores 2008
New York State ELA Exam
New York state ELA test
New York State Test scores
No Child Left Behind
No Child Left Behind Act
passing rates
picking a school
press office
principal bonuses
proficiency scores
push outs
qualitative educational research
qualitative research in education
quitting teaching
race and education
racial segregation in schools
Randall Reback
Randi Weingarten
Randy Reback
recovering credits in high school
Rick Hess
Robert Balfanz
Robert Pondiscio
Roland Fryer
Russ Whitehurst
Sarah Reckhow
school budget cuts in New York City
school choice
school effects
school integration
single sex education
small schools
small schools in New York City
social justice teaching
Sol Stern
Stefanie DeLuca
stereotype threat
talented and gifted
talking about race
talking about race in schools
Teach for America
teacher effectiveness
teacher effects
teacher quailty
teacher quality
teacher tenure
teachers and obesity
Teachers College
teachers versus doctors
teaching as career
teaching for social justice
teaching profession
test score inflation
test scores
test scores in New York City
testing and accountability
Texas accountability
The No Child Left Behind Act
The Persistence of Teacher-Induced Learning Gains
thinktanks in educational research
Thomas B. Fordham Foundation
Tom Kane
University of Iowa
Urban Institute study of Teach for America
Urban Institute Teach for America
value-added assessment
Wendy Kopp
women and graduate school science and engineering
women and science
women in math and science
Woodrow Wilson High School