eduwonkette_header_515.jpg

Through the lens of social science, eduwonkette takes a serious, if sometimes irreverent, look at some of the most contentious education policy debates. (Find eduwonkette's complete archives prior to Jan. 6, 2008 here.)

Main

October 1, 2008

Why skoolboy Is Uncertain about the NYC School Progress Reports

It’s election season, which means that we’re being inundated with polls. The reporting of poll results drives statisticians nuts, because the press often reports the percentage of those surveyed who favor one candidate or another, without taking into account the poll’s margin of error. The margin of error is a way of quantifying the uncertainty in the poll numbers, because even a well-designed poll that surveys a random and representative sample of the population is going to generate an estimate of the true proportion of those in the population who favor a particular candidate. The general rule of thumb is, the more information available in a sample, the less uncertainty in the estimate. A smaller batch of information will yield a more uncertain, or imprecise, estimate than a larger batch of information. This is as true for estimates of the relative performance of schools and teachers—whether in the form of a complex value-added assessment model or a simple percentage—as it is for political polls.

With apologies to anyone who’s had an introductory statistics course, suppose that we were trying to estimate the average age of the teachers in a very small school—one with only four teachers—but we can only draw a sample of three of the teachers to estimate that average. The four teachers are 25, 30, 30, and 55 years old, and the true average age is (25+30+30+55)/4=35. If our sample was the teachers who are 25, 30 and 30, our estimate of the average age of teachers in the school would be (25+30+30)/3=28.25. If our sample was the teachers who are 30, 30 and 50, our estimate of the average would be (30+30+55)/3=38.33. It’s a simple example, but it shows that different samples drawn from a given population can produce quite different estimates, that can be some distance away from the true population value. You wouldn’t want to place too much confidence in a particular estimate if you knew that another, equally valid sample of the same size could generate an estimate that was quite different.

That same logic applies to estimates of school and teacher performance, such as the New York City School Progress Reports. Most of the elements of the Progress Reports are estimates (for an explanation why, see here), but the calculation of the overall letter grades which receive so much attention do not take the uncertainty in these estimates into account. Today, I’ll show that using the 2008 School Progress Reports.

One of the indicators of student progress on the School Progress Reports is the percentage of students who made a year’s worth of progress in English (ELA) and in math from 2007 to 2008. In a given school, each child who was tested in both years can be classified as having made a year’s worth of progress or not, and by totaling up those students who made a year’s worth of progress and dividing by the number of students who were tested in both years, a percentage can be calculated. (There’s an additional wrinkle for students who transferred from one school to another, but it doesn’t affect the logic I’m writing about.)

Each school is compared to a group of 40 peer schools that are judged to be similar based on their demographic and other characteristics. A school’s percentage of children making a year’s progress in ELA is compared to the highest and lowest values in its peer group, and the school gets a peer horizon score that represents its location between the high and low peer group values. For example, if a school had 55% of its students make a year’s progress in ELA, and the percentage for the lowest school in its peer group was 47%, and the percentage for the highest school in its peer group was 71%, the school was located one-third of the way between the lowest and highest schools (8 percentage points above the minimum, out of a possible 24 percentage points above the minimum in the peer group.) That peer horizon score of .33 would be multiplied by the 5.625 points that this component is counted in the calculation of the overall letter grade of the school, yielding a net contribution of 1.875 to the school’s overall score.

The problem is that this calculation doesn’t take into account the fact that all of these percentages are estimates. The chart below looks at one elementary school in particular—Senator John Calandra School (08X014)—and compares it to its peer group of 40 schools. At Calandra, 58.3% of the students made a year’s worth of progress in English in 2008. But the standard error of that percentage is 3.5%, which means that it’s possible that Calandra's true percentage could be anywhere from 51.3% to 65.3%, a wide range. (This range is shown in the “error bars” above and below the estimated percentage for each school.) The same is true for most of the other schools in the peer group. In fact, only two of the 40 schools in the peer group (the ones with the blue markers in the chart) have a percentage that we are confident is higher than Calandra’s percentage. For the other 38 schools in the peer group, we can’t rule out the possibility that Calandra’s percentage is equal to the estimated percentage in those schools. There’s a tremendous amount of overlap among these schools.

08X014.JPG

And yet Calandra received a peer horizon score of .463, and other schools in the peer group whose percentages of students making a year’s worth of progress in English did not differ statistically from Calandra received peer horizon scores ranging from .169 to .903. Calandra’s peer horizon score of .463 counted for 2.6 out of a possible 5.625 points toward the overall score on the School Progress Report. Other peer schools whose percentages did not differ significantly from Calandra’s received from 1.0 to 5.1 points out of a possible 5.625 points on this component of the overall score. Differences of this magnitude could easily make the difference between an overall grade of A and of B, or of B and of C—just due to chance. An accountability system such as the New York City School Progress Reports that doesn’t acknowledge the importance of chance and uncertainty is fundamentally misleading the public about its ability to distinguish the relative performance of schools. Some schools are likely doing significantly better than other schools; the problem is that the School Progress Reports don't provide enough information to judge which ones.

September 30, 2008

No Child Left Behind: Looking Back, Looking Forward

soapy-maggie.gif
I'm knee deep in old NCLB documents, and ran across the Department of Education's NCLB song. NCLB represented not only a major shift in federal education policy, but an embrace of policy/PR boosterism that's enough to make all of us giggle (Remember Armstrong Williams?). Back from 2002, here are the NCLB lyrics:

We're here to thank our president,
For signing this great bill,
That's right! Yeah,
Research shows we know the way,
It's time we showed the will!
No matter how catchy the ditty, a song can't carry a fundamentally flawed law. That's where Tom Toch and Doug Harris come in. They've penned a thoughtful commentary in this week's Ed Week about the future of NCLB (Salvaging Accountability). It's an important one, because it recognizes that NCLB conflates the school's contribution to student learning with what students bring to the school to begin with. Essentially the argument is that:

1) "It’s critical in any accountability system that the metrics used to judge performance reflect accurately the contributions of those being judged."

2) "As a measure of school performance, however, [the NCLB] snapshot strategy is flawed. Because student populations vary greatly from school to school, and because family income, parental education, and a host of other non-school-related factors have a major influence on students’ learning, some schools have to improve student achievement a lot more than others to get their students up to state standards. The federal law is unforgiving of such schools. As a result, it gives an unfair advantage to schools with students from privileged backgrounds, and it fails to measure what matters most: how much students learn during the school year."

3) The Department of Education's Growth Model Pilot offers little improvement over the current rating system because it relies on a projection model - i.e. are students on target to be proficient in a 3 year window? - rather than a true growth model.

4) The new NCLB should dump the projection model, and focus its sanctions on schools that are both low in terms of their growth, and low in terms of their proficiency. And there's no reason to wait for reauthorization - this could all happen via regulations.

No commentary can do it all, so here are some issues to ponder for their next round. The goal of Toch and Harris' proposed system is to make measurement of school performance a more fair and effective enterprise. Why not take the leap and dump 100% proficiency altogether? That way, we could narrowly tailor our sanctions to schools that are low-performing compared to the schools we already have.

And if we're going to go full throttle on value-added models, we can't just punt the measurement problems. For example, Toch and Harris write, "value-added calculations have larger margins of error than NCLB’s proficiency ratings, but because they measure what’s most important in judging schools—student learning gains—their statistical shortcomings are more than worth tolerating."

A poorly designed growth model is no better than the poorly designed proficiency model that we have now, and no one knows this better than New Yorkers. Value-added systems that have literally no relationship between two years' value-added measures are still bad public policy. In short, beware the silver bullet.

September 24, 2008

Could a Monkey Do a Better Job of Predicting Which Schools Show Student Progress in English Skills than the New York City Department of Education?

monkey4.JPG

eduwonkette and I have been blogging about the School Progress Reports released last week by the New York City Department of Education. We’ve shown that, although the performance and environment scores of schools were pretty consistent from last year to this year, the student progress scores were virtually unrelated—knowing a school’s progress score from last year didn’t predict which schools would demonstrate a lot of progress this year. This, we argued, demonstrated that the progress part of the School Progress Report—representing 60% of the letter grade each school received—wasn’t really telling us which schools consistently are promoting student progress, but rather was mostly random error.

The problem was particularly acute in the domain of English Language Arts (ELA). The stability in the student progress scores from 2007 to 2008 was so low that it led skoolboy to wonder if a monkey could actually do a better job predicting which schools show progress in students’ ELA performance in 2008 than relying on the DOE’s 2007 student progress score. The particular measure I examined was the percentage of students in the school making at least one year of progress on the ELA test from last year to this year. (As we've noted in earlier posts, the calculation of this measure changed slightly from 2007 to 2008.)

In the interest of full disclosure, skoolboy didn’t actually rent a monkey to pick the schools. Animals scare him, and he wouldn’t have been able to record the picks while hiding under his bed. What I did instead was use a random number generator to assign each school to the top or bottom half of the distribution of schools on last year’s peer and citywide measures of the percentage of students making a year of progress in English Language Arts.

The DOE got credit for a correct prediction if it correctly predicted that a school would be in the top half of this year’s schools, based on the school being in the top half on the DOE’s 2007 measure, or correctly predicted that a school would be in the bottom half of this year’s schools, based on the school being in the bottom half last year. The monkey got credit for a correct prediction if the randomly-selected location of a school as being in the top half of the 2007 distribution correctly predicted that a school would be in the top half of this year’s schools, or the random pick of being in the bottom half of last year’s distribution correctly predicted that a school would be in the bottom half of this year’s schools. These predictions were done separately for the 570 elementary schools, 128 K-8 schools, and 289 middle schools which received overall letter grades last year and this year.

Round 1. We begin with the peer horizon score for the 570 elementary schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 46% of the elementary schools this year. The monkey correctly predicted the status of 51% of this year’s schools.

Score: Monkey 1, DOE 0.

Round 2. We next turn to the citywide horizon score for the 570 elementary schools. The DOE’s citywide horizon progress score from last year correctly predicted the progress status of 47% of the elementary schools this year. The monkey correctly predicted the status of 52% of this year’s schools.

Score: Monkey 2, DOE 0.

Round 3. In this round, we examine the peer horizon scores for the 128 K-8 schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 45% of the K-8 schools this year. The monkey correctly predicted the status of 55% of this year’s schools.

Score: Monkey 3, DOE 0.

Round 4. Next, we look at the citywide horizon progress scores for the 128 K-8 schools. The DOE’s citywide horizon progress score from last year correctly predicted the progress status of 43% of the K-8 schools this year. The monkey correctly predicted the status of 47% of this year’s schools.

Score: Monkey 4, DOE 0.

Round 5. The final stage of the competition examines the 289 middle schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 40% of the middle schools this year. The monkey correctly predicted the status of 50% of this year’s middle schools.

Score: Monkey 5, DOE 0.

Round 6. The last round looks at the citywide horizon progress scores for the middle schools. The DOE’s citywide horizon progress scores from last year correctly predicted the progress status of 45% of this year’s middle schools. The monkey correctly predicted the status of 49% of this year’s middle schools.

Score: Monkey 6, DOE 0.

skoolboy will forego the cheap jokes about how a monkey could do a better job of managing New York City’s accountability system than the people currently in charge. On the whole, they’re smart, hard-working people, and ridiculing them is not likely to persuade them to change their behavior (as satisfying as it may be at particular moments.) But the system that they have designed and implemented is profoundly flawed, as this comical example illustrates, and it needs to change. eduwonkette and I are going to keep hammering on this point, because it has such important consequences for students and for schools.

And besides: I bet the DOE would beat the monkey in predicting school progress scores in math. (But it wouldn’t be a rout.)

September 23, 2008

What Does Educational Testing Really Tell Us? An Interview with Daniel Koretz

Koretz.jpg
Daniel Koretz, a professor who teaches educational measurement at the Harvard Graduate School of Education, generously agreed to field a few questions about educational testing. He is the author of Measuring Up: What Educational Testing Really Tells Us.

EW: What are the three most common misconceptions about educational testing that Measuring Up hopes to debunk?

DK: There are so many that it is hard to choose, but given the importance of NCLB and other test-based accountability systems, I'd choose these:
* That test scores alone are sufficient to evaluate a teacher, a school, or an educational program.

* That you can trust the often very large gains in scores we are seeing on tests used to hold students accountable.

* That alignment is a cure-all - that more alignment is always better, and that alignment is enough to take care of problems like inflated scores.
EW: I'm intrigued by your third point about alignment. For example, we often hear that because state testing systems are directed towards a particular set of standards, we should primarily be concerned with student outcomes on tests aligned with those standards. This is the common refrain about a "test worth teaching to." What's missing from this argument?

DK: Up to a point, alignment is a clearly good thing: we want clarity about goals, and we want both instruction and assessment to focus on the goals deemed most important.

However, there are two flies in the ointment. The first is that the achievement tests are concerned with, no matter how well aligned, are small samples from large domains of performance. That means that most of the domain, including much of the content and skills relevant to the standards, is necessarily omitted from the test. As I explain in Measuring Up, this is analogous to a political poll or any other survey, and it is not a big problem under low-stakes conditions. Under high-stakes conditions, however, there is a strong incentive to focus on the sampled content at the expense of the omitted material, which causes score inflation. Aligned tests are not exempt. Score inflation does not require that the test include poorly aligned content. Even if the test is right on target, inflation will occur if the accountability program leads people to deemphasize other material that is also important for the conclusions based on scores. And to make this concrete: some of the most serious examples of score inflation in the research literature were found in Kentucky's KIRIS system, which was a standards-based testing program.

The second problem is predictability. To prepare students in a way that inflates scores, you have to know something about the test that is coming this year, not just the ones you have seen in the past. The content, format, style, or scoring of the test has to be somewhat predictable. And, of course, it usually is, as anyone who has looked at tests and test preparation materials should know. Carried too far, alignment actually makes this problem worse, by focusing attention on the particular way that knowledge and skills are presented in a given set of standards. Think about 'power standards,' 'eligible standards,' and 'grade level expectations,' all of which can be labels for narrowing in on the specifics of how a set of skills appear on one state's particular assessment.

Why is this bad? Because many of those specifics are not relevant to the students' broader competence and long-term well-being. Scores on a test are a means to an end, not properly an end in themselves. Education should provide students knowledge and skills that they can use in later study and in the real world. Employers and university faculty will not do students the favor of recasting problems to align with the details of the state tests with which they are familiar. As Audrey Qualls said some years ago: real gains in achievement require that students can perform well when confronted with "unfamiliar particulars." Improving performance on the familiar but not the unfamiliar is score inflation.

EW: What are the implications of score inflation for both measuring and attenuating achievement gaps? Because schools serving disadvantaged students face more pressure to increase test scores via the mechanisms you describe, I worry that true achievement gaps may be unchanged - or even growing - while they appear to be closing based on high-stakes measures.

DK: I share your worry. I have long suspected that on average, inflation will be more severe in low-achieving schools, including those serving disadvantaged students. In most systems, including NCLB, these schools have to make the most rapid gains, but they also face unusually serious barriers to doing so. And in some cases, the size of the gains they are required to make exceed by quite a margin what we know how to produce by legitimate means. This will increase the incentive to take short cuts, including those that will inflate scores. This would be ironic, given that one of the primary rationales for NCLB is to improve equity. Unfortunately, while we have a lot of anecdotal evidence suggesting that this is the case, we have very few serious empirical studies of this. We do have some, such as the RAND study that showed convincingly that the "Texas miracle" in the early 1990s, supposedly including a rapid narrowing of the achievement gap, was largely an illusion. Two of my students are currently working with me on a study of this in one large district, but we are months away from releasing a reviewed paper, and it is only one district.

I have argued for years that one of the most glaring faults of our current educational accountability systems is that we do not sufficiently evaluate their effects, instead trusting - evidence to the contrary - that any increase in scores is enough to let us declare success. We should be doing more evaluation not only because it is needed for the improvement of policy, but also because we have an ethical obligation to the children upon whom we are experimenting. Nowhere is this failure more important than in the case of disadvantaged students, who most need the help of education reform.

Inflation is not the only reason why we are not getting a clear picture of changes in the achievement gap. The other is our insistence on standards-based reporting. As I explain in Measuring Up, relying so much on this form of reporting has been a serious mistake for a number of reasons. One reason is that if one wants to compare change in two groups that start out at different levels - poor and wealthy kids, African American and white kids, whatever - changes in the percents above a standard will always give you the wrong answer. This particular statistic confuses the amount of progress a group makes with the proportion of the group clustered around that particular standard, and the latter has to be different for high- and low-scoring groups. I and others have shown that this distortion is a mathematical certainty, but perhaps most telling is a paper by Bob Linn that shows that if you ask whether the achievement gap has been closing, NAEP will give you different answers - very different answers - depending on whether you use changes in scale scores, changes in percent above Basic, or changes in percent above Proficient. This is not because the relative progress has been different at different levels of performance; it is simply an artifact of using percents above standards. This is only one of many problems with standards-based reporting, but in my opinion, it is by itself sufficient reason to return to other forms of reporting.

September 17, 2008

Between a Political Rock and a Statistical Hard Place

Some days, skoolboy feels bad for the hard-working folks in the New York City Department of Education. They’re caught between a political rock and a statistical hard place. The political rock is the New York State accountability system, which complies with No Child Left Behind’s requirements to test students annually in grades 3-8 in Mathematics and English Language Arts, and to classify students, based on their test scores, as either Not Meeting Learning Standards (Level I), Partially Meeting Learning Standards (Level II), Meeting Learning Standards (Level III), or Meeting Learning Standards with Distinction (Level IV), and then aggregate the performance of students, and subgroups of students, to assess the school’s progress toward the goal of 100% proficiency for all students by the year 2014. The mechanism for this is a series of grade-specific exams, with a broad (but arbitrary, as Dan Koretz explains in Measuring Up) standard-setting process that define the scores on the exam that correspond to the four proficiency levels. Whatever a student’s scale score on the exam, he or she is classified into a particular proficiency level.

The statistical hard place is that the proficiency levels are only part of the story. The NYC DOE has found that the scale scores matter, such that a student whose scale score is halfway between the cutoffs for Level II and Level III, and therefore whose proficiency level is Level II, has a higher probability of graduating from high school on time than a student whose scale score is right at the cutoff for Level II. The scale scores have predictive validity—that is, they predict educational outcomes that we think of as important—but they don’t have the political currency of the proficiency levels specified by the state and the federal government.

There’s no evidence, to skoolboy’s knowledge, that achieving a proficiency level on NCLB-style exams has any predictive validity over and above the scale scores on which they are based. (Another regression discontinuity design study waiting to happen.) But I’ll wager that they don’t.

Whether or not the state/NCLB proficiency levels matter, the NYC DOE is stuck. They have to pay homage to the state standards, even though their internal evidence shows that partial progress—“learning quite a bit,” in skoolboy’s terms—really does matter for students’ futures, and therefore is something that schools should be held accountable for.

And I don’t disagree. I would be comfortable (though not ecstatic) with school progress reports that used changes in scale scores to quantify how much students had learned from one year to the next, under two conditions: (a) if the exams were vertically linked, and (b) if the uncertainty in the estimates of school-level effects on the average change were taken into account. Neither of these conditions is met in the current New York City School Progress Reports.

Navigating the political rock and the statistical hard place is definitely a challenge, both rhetorically and in the construction of the School Progress Reports. Rhetorically, the DOE is obliged to argue that a student who is Level III in fourth grade and Level II in fifth grade has lost ground—that student has fallen off of the sharp Level III cliff—because the state and federal accountability metrics treat this as a sharp discontinuity. But as a practical matter, the student may not have fallen off a cliff; rather, she may be just a little bit lower on a gradual hill in fifth grade than we’d like, but still higher on the hill than she was in fourth grade--and the DOE’s internal analyses document that anyone who is higher on the hill is better off than someone lower.

What’s the DOE to do? Well, it could continue to escalate the rhetoric directed toward its critics. (I note with alarm that the DOE went from calling me by my blogging name “skoolboy” on Monday to calling me “Professor Pallas of Teachers College” on Wednesday—whose proclivity to giving A’s to all of his students will come as a surprise to many of them—what’s next? Examining my teeth?) Or it could speak honestly and openly about the challenge of incorporating political and technical realities into the School Progress Reports. I think readers know which path skoolboy recommends.

Guest Blogger Daniel Koretz on New York City's Progress Reports

Koretz.jpg
Daniel Koretz is a professor who teaches educational measurement at the Harvard Graduate School of Education. He is the author of Measuring Up: What Educational Testing Really Tells Us. Below, he weighs in on the NYC Progress Reports that were released yesterday.

eduwonkette: One of the key points of your book is that test scores alone are insufficient to evaluate a teacher, a school, or an educational program. Yesterday, the New York City Department of Education released its Progress Reports, which grade each school on an A-F scale. 60 percent of the grade is based on year-to-year growth and 25 percent is based on proficiency, so 85 percent of the grade is based on test scores. Do you have any advice to New Yorkers about how to use - or not to use - this information to make sense of how their schools are doing?

Koretz: This is a more complicated question in New York City than in many places because of the complexity of the Progress Reports. So let’s break this into two parts: first, what should people make of scores, including the scores New York released a few weeks ago, and second, what additional should New Yorkers keep in mind in interpreting the Progress Reports?

In the ideal world, where tests are used appropriately, I give parents and others the same warning that people in the testing field have been offering (to little avail) for more than half a century: test scores give you a valuable but limited picture of how kids in a school perform. There are many important aspects of schooling that we do not measure with achievement tests, and even for the domains we do measure—say, mathematics—we test only part of what matters. And test scores only describe performance; they don’t explain it. Decades of research has repeatedly confirmed that many factors other than school quality, such as parental education, affect achievement and test scores. Therefore, schools can be either considerably better or considerably worse than their scores, taken alone, would suggest.

However, there is another complication: when educators are under intense pressure to raise scores, high scores and big increases in scores become suspect. Scores can become seriously inflated—that is, they can increase substantially more than actual student learning. This remains controversial in the education policy world, but it should not be, because the evidence is clear, and similar corruption of accountability measures has been found in a wide variety of different economic and policy areas (so widely that it goes by the name of “Campbell’s Law”). High scores or big gains can indicate either good news or inflation, and in the absence of other data, it is often not possible to distinguish one from the other. As you know, this was a big issue in New York City this year, in part because some of the gains, such as the increase in the proportion at Levels 3-4 in 8th grade math, were remarkably large.

New York City is a special case. It is always necessary to reduce the array of data from a test to some sort of indicators, and NYC has developed its own, called the Progress Reports, which assign schools one of five grades, A through F. My advice to New Yorkers is to pay attention to the information that goes into creating the Progress Reports but to ignore the letter grades and to push for improvements to the evaluation system.

The method for creating Progress Reports is baroque, and it is hard to pick which issues to highlight in a short space. The biggest problems, in my opinion, lie in the estimation of student progress, which constitutes 60% of the grade. The basic idea is that a student’s performance on this year’s test is compared to her performance in the previous grade, and the school gets credit for the change. It sounds simple and logical, but the devil is in the details. (For a non-technical overview of the issues in using value-added models to evaluate teachers and schools, see “A Measured Approach”.)

To keep this reasonably brief, I’ll focus on three problems. First, the tests are not appropriate for this purpose. skoolboy made reference to part of this problem in a posting on your blog. To be used this way, tests in adjacent grades should be constructed in specific ways, and the results have to be placed on a single scale (a process called vertical linking). Otherwise, one has no way of knowing whether, for example, a student who gets the same score in grades 4 and 5 improved, lost ground, or treaded water. The tests used in New York were not constructed for this purpose, and the scale that NYC has layered on top of the system for this purpose is not up to the task.

And that points to the second problem, which again skoolboy noted: the entire system hinges on the assumption that one unit of progress by student A means the same amount of improvement in learning as one unit by student B. This is what is called technically an interval scale, meaning that a given interval or difference means the same thing at any level. Temperature is an interval scale: the change from 40 to 50 degrees signifies the same increase in energy as the change from 150 to 160. There is no reason to believe that the scale used in the Progress Reports is even a reasonable approximation to an interval scale. It starts with the performance standards, which are themselves arbitrary divisions and cannot be assumed to be equal distances apart. The NYC system assigns to these standards new scores that nonetheless assume that the standards are equidistant—so, for example, a school gets the same credit for moving a student from Level 1 to Level 2 as for moving a student from Level 2 to Level 3. Moreover, the NYC system assumes that a student who maintains the same level on this scale has made “a year’s worth of progress.” That assumption is also unwarranted, because standards are set separately by grade, and there is no reason to believe that a given standard, say, Level 3, means a comparable level of performance in adjacent grades. (There is in fact some evidence to the contrary.)

The result is that there is no reason at all to trust that two equally effective schools, one serving higher achieving students than another, will get similar Progress Report grades. Moreover, even within a school, two students who are in fact making identical progress may seem quite different by the city’s measure. There may be reasons for policymakers to give more credit for progress with some students than for progress with others, but if one does that, you no longer have a straightforward, comparable measure of student progress.

And finally, there is the problem of error. People working on value-added models have warned for years that the results from a single year are highly error-prone, particularly for small groups. That seems to be exactly what the NYC results show: far more instability from one year to the next than could credibly reflect true changes in performance. Mayor Bloomberg was quoted in the New York Times on September 17 as saying, “Not a single school failed again. That’s exactly the reason to have grades…It’s working.” This optimistic interpretation does not seem warranted to me. The graph below shows the 2008 letter grades of all schools that received a grade of F in 2007. It strains credulity to believe that if these schools were really “failing” last year, three-fourths of them improved so markedly in a mere 12 months that they deserve grades of A or B. (The proportion of 2007 A schools that remained As was much higher, about 57 percent, but that was partly because grades overall increased sharply.) This instability is sampling error and measurement error at work. It does not make sense for parents to choose schools, or for policymakers to praise or berate schools, for a rating that is so strongly influenced by error.

We should give NYC its due. The Progress Reports are commendable in two respects: considering non-test measures of school climate, and trying to focus on growth. Unfortunately, the former get very little weight, and the growth measures are not yet ready for prime time.

2008 Letter Grades of Schools that Received an F Grade in 2007

NYC%20F%20schools.png

September 14, 2008

Let the Spin Begin

top.gif

Suppose that your fourth-grader takes a state test that shows that she understands the associative property of multiplication, can multiply two-digit numbers by two-digit numbers, and can find the perimeter of a polygon by adding up the length of the sides. A year later, as a fifth-grader, she takes a test that shows that she can compare fractions and decimals using <, > or =; identify the factors of a given number; simplify fractions to their lowest terms; and knows that the sum of the interior angles of a quadrilateral is 360 degrees—but she cannot yet create algebraic or geometric patterns using concrete objects or visual drawings (e.g., rotate and shade geometric shapes). Would you say that your child had lost ground in proficiency, or actually gone backward?

Jim Liebman would. Liebman, the Columbia University law professor on leave as Chief Accountability Officer at the New York City Department of Education, is quoted and paraphrased in an article by Jim Dwyer in Saturday’s New York Times on the F grade that P.S. 8 in Brooklyn Heights will receive in this year’s School Progress Reports—a grade that many are finding hard to believe, given that 80% of the students tested in the school are judged proficient in math, and two-thirds are judged proficient in English Language Arts. Doubly embarrassing, in that Chancellor Joel Klein and Mayor Mike Bloomberg have publicly declared the school to be successful and worthy of emulation.

So the spinmeisters are out, and the spin here is justifying the grade of F by arguing that the children in P.S. 8 are going backward. “You drop them off at the beginning of the year, and on average, by the end of the year, your child lost ground in proficiency,” Dwyer quotes Liebman as saying. “Where was the child last year, and where is the child this year?” Liebman asked. “You’re comparing them to themselves.”

A gentle reminder to Mr. Liebman, who was hired in January, 2006: the state math and ELA tests which children take, and are the primary basis for assigning these lovely letter grades, are not vertically equated. (See skoolboy's testing primer here.) This means that there is no basis for comparing performance on the fourth-grade test with performance on the fifth-grade test. For each test, there is a subjective judgment about what level of performance constitutes proficiency, but the tests are independent. There is no basis for claiming that children are going backward; there’s no justification for claiming that a child “lost ground in proficiency,” since proficiency doesn’t exist in the abstract, but rather in grade-specific skills; and the children are not being compared to themselves, but rather their location in the distribution of children’s performance in one year is being compared to their location in the distribution of children’s performance the following year.

Perhaps Jim Liebman simply misspoke, as perhaps did Chancellor Joel Klein when he referred to statistical significance as “playing something of a game.” Such missteps might arise from the tremendous pressure to justify a particular high-stakes evaluation of a school when there are multiple sources of information about school performance that point in different directions—NCLB status, achievement levels, gains, school quality reviews, not to mention the public pronouncements of Liebman’s boss, and his boss’s boss.

There’s nothing wrong, in skoolboy’s view, in looking at students’ achievement growth as one of several criteria for judging how well a school is doing in relation to other schools. But I would never think of using year-to-year changes in proficiency levels on just two tests as the primary basis for evaluating a school’s performance. And neither would most people who study testing and assessment for a living.

September 12, 2008

Cool People You Should Know: Doug Downey

Doug-Downey.jpg
To many observers of public education, there is no doubt about which schools are failing - it's the schools with low rates of students passing state tests, stupid!

Of course, this assumes that students' achievement is a direct measure of school quality. "Yet we know that this assumption is wrong....It follows that a valid system of school evaluation must separate school effects from nonschool effects on children's achievement and learning" writes Doug Downey, a cool Ohio State sociologist of education you should know, in his recent paper (in collaboration with Paul von Hippel and Melanie Hughes), "Are 'Failing' Schools Really Failing?"

Analyzing data from the Early Childhood Longitudinal Study - Kindergarten Cohort, a national sample of 21,000 kindergarteners that were then followed through 5th grade, Downey and colleagues thus set out to isolate the effects of schools on student learning. The ECLS data are uniquely suited for this task because the study evaluated students in the fall and spring of kindergarten, and again in the fall and spring of first grade. It turns out that summers - a time when students are only affected by non-school influences - are the key to teasing apart school and nonschool factors.

Downey and colleagues look at schools' effectiveness in four different ways. First, they examine NCLB's method - overall test score levels. They then turn to 12-month learning rates; think growth models, which measure test score growth, for example, between a test given in April 2007 and a test given in April 2008. They contrast those rates with 9-month learning rates; imagine a test given in September, and then again in May. Finally, they introduce a measure called impact, which is the difference between the school year and summer learning rate.

"Impact" is attractive because it doesn't require us to measure and statistically control for all of the different aspects of children's nonschool environments that may affect school success, as do cardiac surgery report cards. It captures what we need to know about students' out-of-school environments without bogging us down in the methodological and political problems associated with introducing these controls. And it helps us adjust for "soft" factors like innate student motivation, for which it is difficult to measure and control. Moreover, it holds schools harmless for what happens to their students over the summer, which currently serves as a confounding factor in growth models.

What percent performing in the bottom 20% of overall achievement are actually in the bottom 20% for measures of impact and learning? Less than half! High-achieving schools are concentrated in more affluent communities, but "high impact" schools exist across the socioeconomic spectrum. And the opposite is true. There are plenty of school with good test scores that are skating by because simply because they had advantaged kids to begin with.

What does this all mean for NCLB? Downey and colleagues put it like this:
Our results raise serious concerns about the current methods that are used to hold schools accountable for their students' achievement levels. Because achievement-based evaluation is biased against schools that serve the disadvantaged, evaluating schools on the basis of achievement may actually undermine the NCLB goal of reducing racial/ethnic and socioeconomic gaps in performance. If schools that serve the disadvantaged are evaluated on a biased scale, their teachers and administrators may respond like workers in other industries when they are evaluated unfairly - with frustration, reduced, effort, and attrition. Under a fair system, a school's chances of receiving a high mark should not depend on the kinds of students the school happens to serve.
Crystal clear, creative thinking is the distinguishing feature of Downey's work - see, for example, his paper on school effects on child obesity, or his paper asking if schools are "the great equalizer."

Wonks can rest a little easier tonight with the knowledge that Downey's now turned his attention to NCLB.

September 9, 2008

Lessons for No Child Left Behind from "No Cardiac Surgery Patient Left Behind"

heart_art.jpg
New AYP numbers are out, folks. In California, only 48% of schools made AYP, and only 34% of middle schools did so. In Missouri, only about 40% of schools made AYP. Pick almost any state, and you'll see that there are soaring numbers of schools designated as "in need of improvement." With numbers like these, it's worth considering whether NCLB's measurement apparatus is accurately identifying "failing schools."

One way to get leverage on this question is to consider how other fields approach the issue of accountability. Doctor and hospital accountability for cardiac surgery - also the topic of a NYT commentary today - is instructive in this regard. Borrowing heavily from previous work, let me outline how state governments have approached doctor and hospital accountability in medicine. In subsequent posts this week, I'll write about the outcomes of medical accountability systems, as well as some of their unintended consequences.

Medicine makes use of what is known as “risk adjustment” to evaluate hospitals’ performance. Since the early 1990s, states have rated hospitals performing cardiac surgery in annual report cards. The idea is essentially the same as using test scores to evaluate schools’ performance. But rather than reporting hospitals’ raw mortality rates, states “risk adjust” these numbers to take patient severity into account. The idea is that hospitals caring for sicker patients should not be penalized because their patients were sicker to begin with.

In practice, what risk adjustment means is that mortality is predicted as a function of dozens of patient characteristics. These include a laundry list of medical conditions out of the hospital’s control that could affect a patient’s outcomes: the patient’s other health conditions, demographic factors, lifestyle choices (such as smoking), and disease severity. This prediction equation yields an “expected mortality rate”: the mortality rate that would be expected given the mix of patients treated at the hospital.

While the statistical methods vary from state to state, the crux of risk adjustment is a comparison of expected and observed mortality rates. In hospitals where the observed mortality rate exceeds the expected rate, patients fared worse than they should have. These “adjusted mortality rates” are then used to make apples-to-apples comparisons of hospital performance.

Accountability systems in medicine go even further to reduce the chance that a good hospital is unfairly labeled. Hospitals vary widely in size, for example, and in small hospitals a few aberrant cases can significantly distort the mortality rate. So, in addition to the adjusted mortality rate, confidence intervals are reported to illustrate the uncertainty that stems from these differences in size. Only when these confidence intervals are taken into account are performance comparisons made between hospitals.

Contrast this approach with that used by the New York City Department of Education's progress reports, where "point estimates" are used to array schools on an A-F continuum with no regard for measurement error. Readers know well that your friendly neighborhood "statistical nut" has no beef with the use of sophisticated statistical methods to compare schools. But I would just ask that we have some humility about what these methods can and cannot do. (Sidenote: The only winners when we ignore these issues are educational researchers, who can then write regression discontinuity papers using these data. Thanks for the publications, Joel and Mike!)

And it's quite eye-opening to compare the language used by state and federal governments used to explain their accountability systems with the rhetoric we hear in education. Consider this statement from the Department of Health and Human Services to explain the rationale behind risk adjustment:
The characteristics that Medicare patients bring with them when they arrive at a hospital with a heart attack or heart failure are not under the control of the hospital. However, some patient characteristics may make death more likely (increase the ‘risk’ of death), no matter where the patient is treated or how good the care is. … Therefore, when mortality rates are calculated for each hospital for a 12-month period, they are adjusted based on the unique mix of patients that hospital treated.
If you replace the word "hospital" with "school" above, you can imagine the reception this statement would receive in the educational accountability debate. Soft bigotry of low expectations, and you probably kill baby seals for fun, too.

Readers, why is the educational debate so different? Full disclosure: I will shamelessly appropriate your thoughts in my dissertation, which attempts to answer this question, and also establish the effects of each of these systems on race, gender, and socioeconomic inequalities in educational and health outcomes.

September 7, 2008

Predicting the Near Future*

question_marks.jpg

Sometime soon, with great fanfare, the New York City Department of Education will release this year’s School Progress Reports. (Word on the street is that schools already know their grades.) The School Progress Reports, for better or worse, are the centerpiece of the NYC accountability system. (skoolboy thinks for worse, but more on that later.)

The DOE has made a number of changes to the Progress Reports for this second iteration, and I think that eduwonkette had something to do with that (as did other critics and analysts outside of the Tweed inner circle.) We can expect to see separate letter grades for the three major dimensions on which the Progress Reports are based: school environment (including attendance, and parent, teacher and student surveys), student performance, and student progress. But the overall format appears to be unchanged: most of the grade is based on student progress on test scores, and such gains are not very reliable from one year to the next. There is, in skoolboy’s opinion, a false sense of precision conveyed by these letter grades, as they are based on components that are measured with error, but that measurement error is not reflected in how the grades are calculated. And I’m particularly annoyed at the misuse of social surveys for accountability purposes.

Nevertheless, the DOE is marching onward, and we’ll have this year’s grades to pore over in the near future. (And you can bet that eduwonkette will put on the green eyeshade for this, even though it clashes with her cape and mask.) How many schools will improve their grade from last year to this year? How many will fall? It’s time to make some predictions. What do you think, readers?

Here's a five-by-five table designed to show how this year’s grades are associated with last year’s grade. Each column represents last year’s grade, and each row represents a possible outcome for this year. The column percentages will add up to 100%. Try to fill in the blanks: What percentage of the schools that received A’s last year will receive an A this year? What percentage of A’s will decline to B’s? What fraction will fall further to C’s, D’s, and F’s? At the other end of the spectrum, what percentage of last year’s F’s will remain F’s? What percentage will climb out of the cellar to obtain a D? Will any make the leap from F to A?

crosstab.JPG

As a reminder, last year, about 23% of schools received an A; 38% received a B; 26% received a C; 8% received a D; and 4% (i.e., 53 schools) received an F.

A caveat: The DOE knows that the legitimacy of the School Progress Reports depends on the grades not being too volatile from year to year. If 75% of last year’s A’s became F’s this year, no one would take this scheme seriously. (And if schools that everyone views as exemplary or high-performing got middling grades, this too would call the scheme’s legitimacy into question. So don't expect Stuyvesant High School to get a C.) There may not be very much fluctuation from last year to this. You can be sure that the DOE has constructed this year’s scores so that there’s not too much instability from last year to this year.

But since we believe in incentives on this blog, the reader who comes closest to the actual association between last year and this year shall receive a prize to be selected by eduwonkette—and we know how creative she can be. Be sure to fill in all 25 blanks.

*Employees of Tweed Courthouse, KPMG Consulting, and the Parthenon Group are ineligible for this contest.

August 27, 2008

Guest Blogger Bruce Fuller: The Benefits and Dilemmas of Centralized Accountability

Bruce Fuller, sociologist and professsor of education and public policy at the University of California - Berkeley, has co-edited a new book, Strong States, Weak Schools: The Benefits and Dilemmas of Centralized Accountability. Below, he provides a Q&A on the book’s findings.

Q. Media reports summed-up your findings by saying that teacher responses to the No Child Left Behind Act and state accountability efforts have been “haphazard”, and teachers are feeling demoralized. Didn’t we know this already?

A. We do know that teacher associations are eager to revamp No Child following the November elections, and even recraft Washington’s role in education. And the Bush Administration, business groups, and some civil rights advocates claim that No Child is working.

The seven research teams that came together to produce Strong States, Weak Schools set the stage by first showing that student achievement has inched up at a glacial pace since No Child was enacted in 2002, even slowing progress observed in the 1990s, as state-led accountability and school finance reforms were successfully pursued. Progress is more discernible in certain states.

But few researchers have hung out in schools, interviewed teachers and principals, and asked how front-line educators interpret new accountability regimes. This includes how teachers try to address state curricular standards, how they might use more textured data on what students are learning (or not), and the extent to which principals (and their district superintendents) motivate their teachers to focus on improving their pedagogies.

Earlier ethnographic studies tended to be conducted by scholars with a priori agendas, hoping to detail how teachers feel overly controlled by accountability measures, or how teachers held deep affection for them. Instead, our seven contributing teams probed different parts of the implementation elephant. Do front-line educators in elementary versus secondary schools hold different viewpoints? Do exit exams prompt different responses inside our high schools? Do the rules and tools of accountability programs operate differently to boost average student achievement, in contrast to factors that narrow racial gaps inside schools?

Q. So, does teacher resistance to top-down accountability programs help to explain the tepid gains in student test scores?

All seven teams found that teachers and principals have redoubled their efforts to assist low-performing students, in part because of accountability programs advanced from either state capitals or Washington. The spotlight placed on how student subgroups are doing, the availability of richer data on individual student competencies, and the threat of sanctions are motivating teachers to buckle down and collaborate to devise new pedagogical approaches and build stronger relationships with students.

Yet two factors constrain whether teacher responses are coordinated and effective over time. First, the RAND study, led by Laura Hamilton, found that the attention that teachers pay to curricular standards, whether they study student data, and the value they place on accountability pressures vary enormously within schools. The good news is that teachers in poor communities are not more or less responsive to accountability rules and tools, compared to those in middle-class neighborhoods. The bad news is that teacher responses are highly variable and eclectic within schools. This suggests that relatively few principals motivate their staff to pull in the same direction and employ new training and data tools that accountability programs often support.

Second, the uneven leadership of district superintendents and the stickiness of school institutions – especially high schools – tend to disempower principals. Tom Luschei and Gayle Christensen probed deep into these dynamics, hanging out over time in a few districts. They found that district leaders often respond to accountability demands in ritualized fashion, failing to work intensively with their principals to mobilize rules and tools. Two studies of high school responses, appearing in Strong States, Weak Schools, detail how growth targets, program improvement triggers, and exit exams turn teacher attention to low-achieving adolescents. But these individual-level responses rarely lead to innovative structural change in balkanized high schools.

Q. What is working to motivate teachers and raise student achievement, then?

Two studies in the book offer insights here: Melissa Henne and Heeju Jang examined what worked in 111 California elementary schools as they variably succeeded in closing achievement gaps between Anglo and Latino students. They show that disparities narrow when teachers report that their principal motivates staff to focus on raising achievement and delivers tools that make everyone feel efficacious. This is not simply a mechanical process: more equitable schools have teachers who report strong, respectful relationships with their principal and colleagues.

And Soung Bae went deeper into a California school district that had narrowed ethnic achievement gaps over time. She discovered district leaders who banked heavily on inservice teacher training – hammering on state curricular standards and inventive pedagogies. Then, district staff followed teachers back into their classrooms to provide ample clinical follow-up.

Q. So, what do these implementation studies say to state and federal policy makers who will soon be debating changes in accountability programs?

Pay attention to what motivates teachers, who, like other professionals, seem eager to pursue shared goals if they are trusted to improve their craft. The link between district staff and principals appears to be key. If district leaders are simply messengers of government – with little agility in adapting to rules and mobilizing tools – then their principals will have less capacity to motivate their teachers.

Teachers do report enormous dissatisfaction, at least in California, Georgia, and Pennsylvania, in being forced to ignore certain subjects and topics if they do not appear on state tests. Somehow, policy makers must face the sharp-edged dilemma of simplifying tests and the curriculum, while recognizing that tying the hands of teachers may erode everyone’s motivation.

All seven empirical studies can be viewed here.

August 21, 2008

Cool People You Should Know: David Figlio

David-Figlio-Card.jpg
Economist David Figlio, who has extensively studied the intended and unintended consequences of accountability systems, recently made a move from the University of Florida over to Northwestern. Figlio has a knack for the creative - but still substantive - paper: for example, see his papers on the unintended consequences of accountability systems including Food for Thought? The Effects of School Accountability Plans on School Nutrition, Accountabilty, Ability, and Disability: Gaming the System?, and Testing, Crime, and Punishment. More recently, he mounted an impressive survey of Florida principals to identify their responses to accountability pressures. (See Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Programs.)

In our chat on testing and accountability on Tuesday, Figlio provided a terrific overview of the accountability literature in response to Sherman Dorn's question, which is worth reprinting in full here:
I think that the evidence is becoming clearer that many of the hopes of high-stakes accountability advocates and many of the fears of high-stakes accountability critics are correct -- school administrators and teachers can and do respond to accountability pressures, at least at the margins.

A number of recent studies have shown that schools subject to greater accountability pressure tend to improve student test performance in reading and mathematics to a meaningful degree -- my recent study of Florida with Cecilia Rouse, Jane Hannaway and Dan Goldhaber (working paper on the website of the National Center for the Analysis of Longitudinal Data in Education Research, or caldercenter.org), for instance, suggests test score gains of one-tenth of a standard deviation in reading and math associated with a school getting an "F" grade relative to a "D" grade. We find that these test score gains persist for several years after the student leaves the affected school. Jonah Rockoff of Columbia University has a new working paper studying New York City's rollout of school grades that suggests that responses to grading pressure seem to happen immediately -- grades released in November were mainfested in test score changes in the same winter/spring.

In the case of my study with Rouse, Hannaway and Goldhaber, we try to look inside the "black box" by studying a wide variety of potentially productive school responses, and it appears that Florida schools responded to accountability pressures by changing some of their instructional policies and practices, rather than "gaming the system."

The rapid and apparently productive response of school personnel to school accountability pressure suggests that educators are, at least to some degree "magisters economici," responding to the incentives associated with the system. And this makes getting the system right so important, because if schools and teachers respond quickly to incentives, the incentives had better be what society/policymakers want.

Many people raise concerns about teaching to the test, and there is certainly evidence of this -- consistently, estimated effects of accountability on high-stakes tests are larger than those on low-stakes tests -- though the low-stakes test results tend to be meaningful still, especially with respect to math. Harder to get a handle on is the narrowing of the curriculum to concentrate on the measured subjects; there is a lot of suggestive evidence that this is taking place to a small degree at the elementary level, though studies of the effects of accountability on performance on low-stakes subjects typically don't find that performance on these subjects suffers -- but of course, those subjects are still being measured with tests. Still there is certainly the incentive to reduce focus on "low-stakes" subjects. One possible solution for those concerned about low-stakes subjects being given short shrift would be to impose requirements such as minimum time spent of instruction or portfolio reviews.

There is a lot of evidence that accountability systems can have unintended consequences that are predicted by the magister economicus model. Derek Neal and Diane Whitmore Schanzenbach at the University of Chicago note that accountability systems based on getting students above a given performance threshold tend to induce schools to focus on the kids on the "bubble." I've found that that type of system may lead schools to employ selective discipline in an apparent attempt to shape the testing pool, or even to utilize the school meals program to artificially boost student test performance by "carbo-loading" students for peak short-term brain activity. These types of unintended consequences are much more likely in accountability systems based on the "status" model of getting students above a proficiency threshold, rather than the "gains" model of evaluating schools based on how much these students gain.

But there's a tradeoff here. The more we evaluate schools based on test score gains, where gaming incentives are lower, the more the focus is taken off of poorly-performing students whom society/policymakers would like to see attain proficiency. How the system is designed is crucially important.
You can find the transcript for the chat on testing and accountability here.

August 15, 2008

Join a Chat about Testing and Accountability in the NCLB Era: Tuesday, August 19th, 3-4pm

chatty1.jpg
On Tuesday, David Figlio - an economist who does great work on the intended and unintended consequences of accountability systems - and I will chat with Ed Week readers about testing and accountability. The event description is below, and you can submit questions here:
Raising student achievement has long been a major issue in the American public education system. But with the advent of the No Child Left Behind Act and its testing mandates, even more attention has been directed towards this issue. As states release their annual school report cards, testing and accountability have once again emerged as hot topics of debate, with New York City Public Schools receiving considerable scrutiny of late.

Consequently, many observers have questioned whether state testing and accountability systems are accurately depicting student performance and the size of the achievement gap between groups.

July 16, 2008

The Vision Vacuum

"You're too young to be this cynical, " he said, staring across his desk at me with a perplexed half smile.

I was 10, and in the middle of our classroom's simulated presidential campaign in which we followed the election and voted for candidates, my 5th grade teacher had launched into a pep talk about the potential for real change.

The last eight years have done little to temper my built-in skepticism. These are dark times, Diane Ravitch reminds us this morning. If I saw the glass as half-empty when Bush assumed the presidency, I now see it as half full - with poison.

That spin has taken over education policymaking hasn't helped. Accountability, as we used to talk about it back in the 1990s, was a way to evaluate reforms and provide incentives to implement them. It was never intended to be the reform. Now everyone's being tested and rated and graded and held accountable, but no one is supporting schools to improve the day-to-day work of teaching and learning. Policymakers say they want to leave "no child behind," but are willing to deny them health care in their next breath. We've adopted every technocratic solution that newly minted MBAs can come up with, but we have no educational vision.

So it was with cautious optimism that I followed Randi Weingarten's acceptance of the AFT presidency on Monday. As Dan Brown articulates in this post, she's a fighter, and one at the forefront of critiquing our current reform movement's easy slogans, "Too often, testing has replaced instruction; data has replaced professional judgment; compliance has replaced excellence; and so-called leadership has replaced teacher professionalism."

In her acceptance speech, which was bold and unapologetic, she embraced the proposed reforms of the Bolder and Broader coalition, and let us imagine what an alternate educational vision for public schools could look like. Watch the whole speech and let me know what you think - or just take a look at the clip below.

July 2, 2008

Educational Testing: A Brief Glossary

While you’re waiting for Dan Koretz’ book on testing to arrive – I think eduwonkette and I should get some kind of consideration for shilling for this book so often here – here’s a brief skoolboy’s-eye view on testing. Actual psychometricians are welcome to correct what I have to say.

Tests are typically designed to compare the performance of students (whether as individuals, or as members of a group) either to an external standard for performance or to one another. Tests that compare students to an external standard are called criterion-referenced tests; those that compare students to one another are called norm-referenced tests. Even though criterion-referenced tests are intended to hold students’ performance up to an external standard, there is often a strong temptation to compare the performance of individual students and groups of students on such tests, as if they were norm-referenced.

A typical standardized test of academic performance will have a series of items to which students respond, generally either in a multiple-choice or constructed response format, which means that students are constructing a response to the item. There’s usually only one right answer to a multiple-choice item, whereas constructed-response items may be scored so that students get partial credit if they demonstrate partial mastery of the skill or competency that the item is intended to represent. For any test-taker, we can add up the number of right answers, plus the scores on the constructed-response items, to derive the student’s raw score on the test. A test with 45 multiple-choice items would have raw scores ranging from 0 to 45.

For individual test items, we can look at the proportion of test-takers who answered the item correctly, which is referred to as the item difficulty or p-value, which has nothing to do with the p-values used in tests of statistical significance, but rather the proportion (p) of examinees who got the item right. Some test items are more difficult than others, and hence items will have varying p-values.

Raw scores are rarely interpretable, in part because they are a function of the difficulty of the items. For this reason, they are typically transformed into scale scores, which are designed to generate a score that will mean the same thing from one version of a test to the next, or from one year to the next. The scale for scale scores is arbitrary; the SAT is reported on a scale ranging from 200 to 800, whereas the NAEP scale ranges from 0 to 500.

The process of transforming raw scores into scale scores is computationally intensive, generally using a technique known as Item Resp