eduwonkette_header_515.jpg

Through the lens of social science, eduwonkette takes a serious, if sometimes irreverent, look at some of the most contentious education policy debates. (Find eduwonkette's complete archives prior to Jan. 6, 2008 here.)

Main

January 23, 2009

Wish #2: The End of Proficiency Only Accountability Systems

mylerdude_Pico_superdog_81688_o_thumb.jpg
The No Child Left Behind Act may represent the largest threshold-based government accountability system in the country. Schools are evaluated not by how much progress students make, but by their success in pushing students over the proficiency bar. By now, you’re probably familiar with the discontents of this system: states can game the system by setting that proficiency bar low; some schools have triaged their students, essentially reallocating resources to the kids most likely to become proficient in the very short-term; and policymakers can misleadingly make claims about declining racial achievement gaps based on proficiency rates, even as these gaps are unchanged or growing.

Proficiency-based accountability systems leave us in a terrible spot. On the one hand, we want to push kids and raise the bar for proficiency. But on the other hand, we want to make sure that the lowest performing students aren’t kicked to the curb. The higher you raise that bar, the more likely you are to have a significant proportion of students in any given school below proficiency. And those are precisely the conditions under which it makes sense for educators to allocate their time and attention strategically.

All of this, of course, should have been expected in a system focused on proficiency rather than growth. And contrary to popular belief, NCLB's growth model pilot doesn't allow true value-added models, but is instead based on a "projection model” which requires all students to reach a fixed proficiency target regardless of their initial achievement levels.

What am I suggesting? The new Department of Education would do well to let states experiment with a few different accountability systems: 1) dump proficiency altogether and identify schools as in need of improvement based on whether they are making less growth than expected. In other words, drop NCLB’s arbitrary targets and evaluate schools based on how they are doing compared to the schools we already have, or 2) keep proficiency around, but focus improvement efforts on schools that are both low-growth and low-proficiency – not relative to an arbitrary standard, but perhaps those in the bottom 15% of both categories. (That number should be set based on the number of schools to which states can provide targeted support.)

Either of those options would require significant new investments in better tests that are designed to measure growth, and careful attention to building a value-added model that is both valid and reliable. New Yorkers know well that a poorly designed value-added model at the center of the Progress Reports wreaks more havoc than no value-added model at all.

My recommendations will surely fail to impress the “no excuses” crowd (or more aptly, the “nuke the system” crowd—my belated entry into Elizabeth’s Green’s name-the-reformer contest) who see anything short of “100% proficiency” as not radical enough. “No excuses” is great rhetoric, but in the end it’s just that. So my wish #2 is that we move past this bravado in the next four years and develop a more reasonable and effective way of identifying and supporting low-performing schools in getting better.

PS: Check out Richard Rothstein's related op-ed, Getting Accountability Right, which speaks back to Wish #4 (integrating a broad set of goals of public schooling into accountability systems).

December 17, 2008

NYC's Trojan Horse

trojan%20horse.jpg
skoolboy has absolutely nothing of substance to say about Education Secretary nominee Arne Duncan, whom he has met exactly once. But he continues to mouth off about New York City's Teacher Data Reports, the NYC Department of Education's version of value-added assessment. Which are not to be used to evaluate teacher performance. But rather for instructional improvement. Excuse me, skoolboy has something in his eye.

It's hard not to view these Teacher Data Reports as a Trojan Horse. Just how is a tool that is designed for capacity-sorting supposed to function for capacity-building? After all, a teacher value-added measure might tell us something useful about which teachers are more or less successful in raising their students' test scores, but it tells us nothing about the specific instructional practices that account for their relative success.

How are Teacher Data Reports supposed to improve instruction? In her videotaped comments to teachers, Amy McIntosh, the Chief Talent Officer at NYC's Department of Education, says, "These reports will provide information that will help teachers and school leaders gain insights about important aspects of a teacher's practice ... Whether individual teachers have a greater influence on the learning of some groups of students than on others ... Finally, we can see what teachers might benefit from development focused on, say, the needs of English language learners, and which teachers might be best positioned to lead that kind of professional development ... We also think they will ... help you think about how you can share the techniques you use with your colleagues in your school or across the city."

Hmm. So the specific strategies for improving teaching practice are what, exactly? Having more successful teachers lead the professional development of less successful teachers? Expert practitioners don't always make expert coaches. Hall-of-Fame pro basketball player Isiah Thomas--unquestioned as one of the best point guards of all time--was a mediocre coach for the Indiana Pacers and New York Knicks.

Here's why. Teaching is an extraordinarily complex activity, with teachers making thousands of decisions in the course of their work. Successful teachers make many good decisions and some bad decisions, whereas less successful teachers make many bad decisions and some good decisions. But the capacity to reflect on one's practice and figure out which of those decisions are good and which are bad is exceedingly rare, as is the capacity to share this knowledge with others. In the absence of this reflective capacity, we're all prone to attribute our successes and failures to our pet theories, which may or may not be correct. A Teacher Data Report that provides reassurance that a teacher is successful will only solidify and reinforce a personal folk theory about the reasons for that success.

Yet the Teacher Data Report provides no evidence whatsoever about why a teacher is successful--the many daily practices that promote student learning. And if a teacher's personal theory is inaccurate, then sharing it with others will not improve instruction, nor student achievement. It could even make things worse, focusing attention on ineffective practices. A tool like the Teacher Data Report that claims to be useful for increasing teachers' capacity to teach students effectively, but instead is only useful for ranking teachers on their effectiveness, is a modern-day Trojan Horse.

December 15, 2008

Don't Think about Elephants

elephant-klein.jpg

"Don’t think about elephants," skoolboy’s father used to joke, long before George Lakoff’s manifesto with a similar name. The joke, of course, is that by trying not to think about elephants, all that you can think about is elephants. The harder I tried not to think about elephants, the more I thought about them.

The New York City Department of Education has its own variation. This month, the DOE is sending Teacher Data Reports, which purport to estimate the effect of individual teachers in grades 4-8 on students’ test scores, to school principals, who will then distribute the reports to their teachers after the principals have been trained. "The Teacher Data Reports are not to be used for evaluation purposes," wrote Chancellor Joel Klein and UFT President Randi Weingarten in an October letter to teachers. "That is, they won’t be used in tenure determinations or the annual rating process. Administrators will be specifically directed accordingly." Similarly, the Frequently Asked Questions section of the DOE’s Teacher Data Tool Kit website poses the question "How can you be sure that principals won’t use the Teacher Data Reports to evaluate teachers?" The response: "Principals have been and will continue to be explicitly instructed not to use Teacher Data Reports to evaluate their teachers. The DOE has standard processes in schools for teachers to raise issues or concerns."

And yet. From the Frequently Asked Questions on the DOE’s Teacher Data Toolkit website: "By isolating individual teachers’ contributions to student progress, the Teacher Data Reports provide valuable information to school leaders and teachers about where to focus instructional improvement efforts. …Teacher Data Reports provide information about how individual teachers’ efforts influence student learning … A sophisticated multivariate regression analysis based on NYC data from 1999-2008 determined how much to weigh each factor [to calculate students’ predicted gains] … A panel of technical experts has approved the DOE’s value-added methodology. The DOE’s model has met recognized standards for demonstrating validity and reliability. Teachers’ value-added scores from the model are positively correlated with both School Progress Report scores and principals’ perceptions of teachers’ effectiveness, as measured by a research study conducted during the pilot of this initiative."

In other words: The Teacher Data Reports rely on sophisticated statistical techniques that are valid, reliable and approved by experts, and they isolate an individual teacher’s contributions to student learning. But, you principals who are under tremendous pressure to increase test scores or face losing your jobs, don’t you dare think about using these Teacher Data Reports to evaluate teachers.

Don’t think about elephants.

November 12, 2008

School Progress Grade Effects on NYC Achievement: Tame, Fierce, or a Hot Mess?

winters_photo.jpg

skoolboy ventured into the rarified air of NYC’s Harvard Club yesterday to hear Marcus Winters present his new Manhattan Institute research on the effects of the 2006-07 New York City School Progress Reports on students’ 2008 performance on state math and English tests in grades four through eight. The analysis uses a regression-discontinuity design, capitalizing on the fact that schools received a continuous total score summarizing their performance on school environment (15%), student performance (30%) and student growth (55%), but there are firm cut-offs that distinguish schools receiving an F from those receiving a D, those receiving a D from those receiving a C, etc. This means that there might be schools that are very similar in their total scores, and presumably on other school characteristics, on either side of a given cut-off, allowing researchers to study the test-score consequences of obtaining a specific letter grade.

The two tables below summarize the impact of the Progress Report grades on student math and English proficiency, respectively. Both tables contrast the consequences of getting an A, B, D or F with a reference category, a C grade. A green up-arrow indicates that students in a school that received a particular Progress Report Grade did better than students in C schools, whereas a red down-arrow indicates that students did worse than students in C schools. An X indicates that student performance did not differ significantly from that of students in C schools at the p<.05 level.

Winters-Math.jpg


Winters-ELA.jpg

There’s a lot of X’s. In math, students in F schools did better than students in schools receiving higher grades, although this seems to be primarily due to an effect in grade 5. Students in D schools also did better than those in schools receiving higher grades, also due to their advantages in grade 5, apparently. In English, the letter grade a school received did not have any consequences for student performance.

Although both Winters and discussant Jonah Rockoff were careful to note limits both to the analyses and what they can tell us about the incentive effects of accountability systems, both characterized the results as pretty clear evidence that schools reacted to receiving an F or a D in ways that boosted student achievement. This was particularly noteworthy, they argued, because such little time had elapsed between when a school learned that it had received a D or F and when students were tested—January, for English, and March, for mathematics.

Well, yeah, the short time between receiving the grade and the testing is certainly an issue, and surfaced as the likely explanation for why no effects of the School Progress Report grades were found in English. But skoolboy is still worried about math. There were no statistically reliable consequences for getting a D or an F in grades 4, 6, 7 and 8; only in grade 5 is there a test-score boost. How are we to make sense of this? If the letter grades are such a powerful incentive, wouldn’t they affect the performance of students in all of the grades in a school, not just fifth-graders?

Cool person Amy Ellen Schwartz posed a very smart question from the audience. "What about those A and B schools doing worse than the C schools in 5th grade math? What does that mean?" she asked. The panelists didn’t want to address that head-on, in skoolboy’s view, but he will: Looking at 5th grade mathematics, there’s as much evidence of the receipt of an A or a B causing a school to coast as there is evidence of the receipt of a D or an F causing a school to be more productive. Probably not a popular interpretation among the true believers in the power of incentives in the room.

But the bigger story is one of what Winters called "tame" effects. No effects of the School Progress Report grades in English, and limited evidence of effects in Math. A short time-horizon between the “treatment” of receiving the grades and student testing. Ambiguous incentives, both positive and negative, associated with the grades. A very weak theory of how the grades would be expected to increase student performance. It’s a wonder that Winters found anything at all.

A last point: Winters suggested that there were dire predictions that schools would "give up" if they got low Progress Report grades, and his findings, he said, did not show that. Although there were editorials at the time of the initial release of the Progress Reports last fall expressing concern that schools might be stigmatized by getting a C, D or F when students were performing at generally high levels, I question whether anyone thought that schools, and the educators who work in them, would "give up." The more predictable reaction—which I think was born out—was that principals, teachers and parents would simply not believe the Progress Report grades accurately characterized what they saw on a day-to-day basis. A lot of stakeholders don’t believe that the Progress Report grades are reliable measures of school performance, and given what eduwonkette and I have shown about the instability in the student progress measures at the heart of the system, those beliefs are well-founded.

A brief version of the research can be found here. The technical version is now available at the same location.

October 1, 2008

Why skoolboy Is Uncertain about the NYC School Progress Reports

It’s election season, which means that we’re being inundated with polls. The reporting of poll results drives statisticians nuts, because the press often reports the percentage of those surveyed who favor one candidate or another, without taking into account the poll’s margin of error. The margin of error is a way of quantifying the uncertainty in the poll numbers, because even a well-designed poll that surveys a random and representative sample of the population is going to generate an estimate of the true proportion of those in the population who favor a particular candidate. The general rule of thumb is, the more information available in a sample, the less uncertainty in the estimate. A smaller batch of information will yield a more uncertain, or imprecise, estimate than a larger batch of information. This is as true for estimates of the relative performance of schools and teachers—whether in the form of a complex value-added assessment model or a simple percentage—as it is for political polls.

With apologies to anyone who’s had an introductory statistics course, suppose that we were trying to estimate the average age of the teachers in a very small school—one with only four teachers—but we can only draw a sample of three of the teachers to estimate that average. The four teachers are 25, 30, 30, and 55 years old, and the true average age is (25+30+30+55)/4=35. If our sample was the teachers who are 25, 30 and 30, our estimate of the average age of teachers in the school would be (25+30+30)/3=28.25. If our sample was the teachers who are 30, 30 and 50, our estimate of the average would be (30+30+55)/3=38.33. It’s a simple example, but it shows that different samples drawn from a given population can produce quite different estimates, that can be some distance away from the true population value. You wouldn’t want to place too much confidence in a particular estimate if you knew that another, equally valid sample of the same size could generate an estimate that was quite different.

That same logic applies to estimates of school and teacher performance, such as the New York City School Progress Reports. Most of the elements of the Progress Reports are estimates (for an explanation why, see here), but the calculation of the overall letter grades which receive so much attention do not take the uncertainty in these estimates into account. Today, I’ll show that using the 2008 School Progress Reports.

One of the indicators of student progress on the School Progress Reports is the percentage of students who made a year’s worth of progress in English (ELA) and in math from 2007 to 2008. In a given school, each child who was tested in both years can be classified as having made a year’s worth of progress or not, and by totaling up those students who made a year’s worth of progress and dividing by the number of students who were tested in both years, a percentage can be calculated. (There’s an additional wrinkle for students who transferred from one school to another, but it doesn’t affect the logic I’m writing about.)

Each school is compared to a group of 40 peer schools that are judged to be similar based on their demographic and other characteristics. A school’s percentage of children making a year’s progress in ELA is compared to the highest and lowest values in its peer group, and the school gets a peer horizon score that represents its location between the high and low peer group values. For example, if a school had 55% of its students make a year’s progress in ELA, and the percentage for the lowest school in its peer group was 47%, and the percentage for the highest school in its peer group was 71%, the school was located one-third of the way between the lowest and highest schools (8 percentage points above the minimum, out of a possible 24 percentage points above the minimum in the peer group.) That peer horizon score of .33 would be multiplied by the 5.625 points that this component is counted in the calculation of the overall letter grade of the school, yielding a net contribution of 1.875 to the school’s overall score.

The problem is that this calculation doesn’t take into account the fact that all of these percentages are estimates. The chart below looks at one elementary school in particular—Senator John Calandra School (08X014)—and compares it to its peer group of 40 schools. At Calandra, 58.3% of the students made a year’s worth of progress in English in 2008. But the standard error of that percentage is 3.5%, which means that it’s possible that Calandra's true percentage could be anywhere from 51.3% to 65.3%, a wide range. (This range is shown in the “error bars” above and below the estimated percentage for each school.) The same is true for most of the other schools in the peer group. In fact, only two of the 40 schools in the peer group (the ones with the blue markers in the chart) have a percentage that we are confident is higher than Calandra’s percentage. For the other 38 schools in the peer group, we can’t rule out the possibility that Calandra’s percentage is equal to the estimated percentage in those schools. There’s a tremendous amount of overlap among these schools.

08X014.JPG

And yet Calandra received a peer horizon score of .463, and other schools in the peer group whose percentages of students making a year’s worth of progress in English did not differ statistically from Calandra received peer horizon scores ranging from .169 to .903. Calandra’s peer horizon score of .463 counted for 2.6 out of a possible 5.625 points toward the overall score on the School Progress Report. Other peer schools whose percentages did not differ significantly from Calandra’s received from 1.0 to 5.1 points out of a possible 5.625 points on this component of the overall score. Differences of this magnitude could easily make the difference between an overall grade of A and of B, or of B and of C—just due to chance. An accountability system such as the New York City School Progress Reports that doesn’t acknowledge the importance of chance and uncertainty is fundamentally misleading the public about its ability to distinguish the relative performance of schools. Some schools are likely doing significantly better than other schools; the problem is that the School Progress Reports don't provide enough information to judge which ones.

September 30, 2008

No Child Left Behind: Looking Back, Looking Forward

soapy-maggie.gif
I'm knee deep in old NCLB documents, and ran across the Department of Education's NCLB song. NCLB represented not only a major shift in federal education policy, but an embrace of policy/PR boosterism that's enough to make all of us giggle (Remember Armstrong Williams?). Back from 2002, here are the NCLB lyrics:

We're here to thank our president,
For signing this great bill,
That's right! Yeah,
Research shows we know the way,
It's time we showed the will!
No matter how catchy the ditty, a song can't carry a fundamentally flawed law. That's where Tom Toch and Doug Harris come in. They've penned a thoughtful commentary in this week's Ed Week about the future of NCLB (Salvaging Accountability). It's an important one, because it recognizes that NCLB conflates the school's contribution to student learning with what students bring to the school to begin with. Essentially the argument is that:

1) "It’s critical in any accountability system that the metrics used to judge performance reflect accurately the contributions of those being judged."

2) "As a measure of school performance, however, [the NCLB] snapshot strategy is flawed. Because student populations vary greatly from school to school, and because family income, parental education, and a host of other non-school-related factors have a major influence on students’ learning, some schools have to improve student achievement a lot more than others to get their students up to state standards. The federal law is unforgiving of such schools. As a result, it gives an unfair advantage to schools with students from privileged backgrounds, and it fails to measure what matters most: how much students learn during the school year."

3) The Department of Education's Growth Model Pilot offers little improvement over the current rating system because it relies on a projection model - i.e. are students on target to be proficient in a 3 year window? - rather than a true growth model.

4) The new NCLB should dump the projection model, and focus its sanctions on schools that are both low in terms of their growth, and low in terms of their proficiency. And there's no reason to wait for reauthorization - this could all happen via regulations.

No commentary can do it all, so here are some issues to ponder for their next round. The goal of Toch and Harris' proposed system is to make measurement of school performance a more fair and effective enterprise. Why not take the leap and dump 100% proficiency altogether? That way, we could narrowly tailor our sanctions to schools that are low-performing compared to the schools we already have.

And if we're going to go full throttle on value-added models, we can't just punt the measurement problems. For example, Toch and Harris write, "value-added calculations have larger margins of error than NCLB’s proficiency ratings, but because they measure what’s most important in judging schools—student learning gains—their statistical shortcomings are more than worth tolerating."

A poorly designed growth model is no better than the poorly designed proficiency model that we have now, and no one knows this better than New Yorkers. Value-added systems that have literally no relationship between two years' value-added measures are still bad public policy. In short, beware the silver bullet.

September 24, 2008

Could a Monkey Do a Better Job of Predicting Which Schools Show Student Progress in English Skills than the New York City Department of Education?

monkey4.JPG

eduwonkette and I have been blogging about the School Progress Reports released last week by the New York City Department of Education. We’ve shown that, although the performance and environment scores of schools were pretty consistent from last year to this year, the student progress scores were virtually unrelated—knowing a school’s progress score from last year didn’t predict which schools would demonstrate a lot of progress this year. This, we argued, demonstrated that the progress part of the School Progress Report—representing 60% of the letter grade each school received—wasn’t really telling us which schools consistently are promoting student progress, but rather was mostly random error.

The problem was particularly acute in the domain of English Language Arts (ELA). The stability in the student progress scores from 2007 to 2008 was so low that it led skoolboy to wonder if a monkey could actually do a better job predicting which schools show progress in students’ ELA performance in 2008 than relying on the DOE’s 2007 student progress score. The particular measure I examined was the percentage of students in the school making at least one year of progress on the ELA test from last year to this year. (As we've noted in earlier posts, the calculation of this measure changed slightly from 2007 to 2008.)

In the interest of full disclosure, skoolboy didn’t actually rent a monkey to pick the schools. Animals scare him, and he wouldn’t have been able to record the picks while hiding under his bed. What I did instead was use a random number generator to assign each school to the top or bottom half of the distribution of schools on last year’s peer and citywide measures of the percentage of students making a year of progress in English Language Arts.

The DOE got credit for a correct prediction if it correctly predicted that a school would be in the top half of this year’s schools, based on the school being in the top half on the DOE’s 2007 measure, or correctly predicted that a school would be in the bottom half of this year’s schools, based on the school being in the bottom half last year. The monkey got credit for a correct prediction if the randomly-selected location of a school as being in the top half of the 2007 distribution correctly predicted that a school would be in the top half of this year’s schools, or the random pick of being in the bottom half of last year’s distribution correctly predicted that a school would be in the bottom half of this year’s schools. These predictions were done separately for the 570 elementary schools, 128 K-8 schools, and 289 middle schools which received overall letter grades last year and this year.

Round 1. We begin with the peer horizon score for the 570 elementary schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 46% of the elementary schools this year. The monkey correctly predicted the status of 51% of this year’s schools.

Score: Monkey 1, DOE 0.

Round 2. We next turn to the citywide horizon score for the 570 elementary schools. The DOE’s citywide horizon progress score from last year correctly predicted the progress status of 47% of the elementary schools this year. The monkey correctly predicted the status of 52% of this year’s schools.

Score: Monkey 2, DOE 0.

Round 3. In this round, we examine the peer horizon scores for the 128 K-8 schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 45% of the K-8 schools this year. The monkey correctly predicted the status of 55% of this year’s schools.

Score: Monkey 3, DOE 0.

Round 4. Next, we look at the citywide horizon progress scores for the 128 K-8 schools. The DOE’s citywide horizon progress score from last year correctly predicted the progress status of 43% of the K-8 schools this year. The monkey correctly predicted the status of 47% of this year’s schools.

Score: Monkey 4, DOE 0.

Round 5. The final stage of the competition examines the 289 middle schools. The DOE’s peer horizon progress score from last year correctly predicted the progress status of 40% of the middle schools this year. The monkey correctly predicted the status of 50% of this year’s middle schools.

Score: Monkey 5, DOE 0.

Round 6. The last round looks at the citywide horizon progress scores for the middle schools. The DOE’s citywide horizon progress scores from last year correctly predicted the progress status of 45% of this year’s middle schools. The monkey correctly predicted the status of 49% of this year’s middle schools.

Score: Monkey 6, DOE 0.

skoolboy will forego the cheap jokes about how a monkey could do a better job of managing New York City’s accountability system than the people currently in charge. On the whole, they’re smart, hard-working people, and ridiculing them is not likely to persuade them to change their behavior (as satisfying as it may be at particular moments.) But the system that they have designed and implemented is profoundly flawed, as this comical example illustrates, and it needs to change. eduwonkette and I are going to keep hammering on this point, because it has such important consequences for students and for schools.

And besides: I bet the DOE would beat the monkey in predicting school progress scores in math. (But it wouldn’t be a rout.)

September 23, 2008

What Does Educational Testing Really Tell Us? An Interview with Daniel Koretz

Koretz.jpg
Daniel Koretz, a professor who teaches educational measurement at the Harvard Graduate School of Education, generously agreed to field a few questions about educational testing. He is the author of Measuring Up: What Educational Testing Really Tells Us.

EW: What are the three most common misconceptions about educational testing that Measuring Up hopes to debunk?

DK: There are so many that it is hard to choose, but given the importance of NCLB and other test-based accountability systems, I'd choose these:
* That test scores alone are sufficient to evaluate a teacher, a school, or an educational program.

* That you can trust the often very large gains in scores we are seeing on tests used to hold students accountable.

* That alignment is a cure-all - that more alignment is always better, and that alignment is enough to take care of problems like inflated scores.
EW: I'm intrigued by your third point about alignment. For example, we often hear that because state testing systems are directed towards a particular set of standards, we should primarily be concerned with student outcomes on tests aligned with those standards. This is the common refrain about a "test worth teaching to." What's missing from this argument?

DK: Up to a point, alignment is a clearly good thing: we want clarity about goals, and we want both instruction and assessment to focus on the goals deemed most important.

However, there are two flies in the ointment. The first is that the achievement tests are concerned with, no matter how well aligned, are small samples from large domains of performance. That means that most of the domain, including much of the content and skills relevant to the standards, is necessarily omitted from the test. As I explain in Measuring Up, this is analogous to a political poll or any other survey, and it is not a big problem under low-stakes conditions. Under high-stakes conditions, however, there is a strong incentive to focus on the sampled content at the expense of the omitted material, which causes score inflation. Aligned tests are not exempt. Score inflation does not require that the test include poorly aligned content. Even if the test is right on target, inflation will occur if the accountability program leads people to deemphasize other material that is also important for the conclusions based on scores. And to make this concrete: some of the most serious examples of score inflation in the research literature were found in Kentucky's KIRIS system, which was a standards-based testing program.

The second problem is predictability. To prepare students in a way that inflates scores, you have to know something about the test that is coming this year, not just the ones you have seen in the past. The content, format, style, or scoring of the test has to be somewhat predictable. And, of course, it usually is, as anyone who has looked at tests and test preparation materials should know. Carried too far, alignment actually makes this problem worse, by focusing attention on the particular way that knowledge and skills are presented in a given set of standards. Think about 'power standards,' 'eligible standards,' and 'grade level expectations,' all of which can be labels for narrowing in on the specifics of how a set of skills appear on one state's particular assessment.

Why is this bad? Because many of those specifics are not relevant to the students' broader competence and long-term well-being. Scores on a test are a means to an end, not properly an end in themselves. Education should provide students knowledge and skills that they can use in later study and in the real world. Employers and university faculty will not do students the favor of recasting problems to align with the details of the state tests with which they are familiar. As Audrey Qualls said some years ago: real gains in achievement require that students can perform well when confronted with "unfamiliar particulars." Improving performance on the familiar but not the unfamiliar is score inflation.

EW: What are the implications of score inflation for both measuring and attenuating achievement gaps? Because schools serving disadvantaged students face more pressure to increase test scores via the mechanisms you describe, I worry that true achievement gaps may be unchanged - or even growing - while they appear to be closing based on high-stakes measures.

DK: I share your worry. I have long suspected that on average, inflation will be more severe in low-achieving schools, including those serving disadvantaged students. In most systems, including NCLB, these schools have to make the most rapid gains, but they also face unusually serious barriers to doing so. And in some cases, the size of the gains they are required to make exceed by quite a margin what we know how to produce by legitimate means. This will increase the incentive to take short cuts, including those that will inflate scores. This would be ironic, given that one of the primary rationales for NCLB is to improve equity. Unfortunately, while we have a lot of anecdotal evidence suggesting that this is the case, we have very few serious empirical studies of this. We do have some, such as the RAND study that showed convincingly that the "Texas miracle" in the early 1990s, supposedly including a rapid narrowing of the achievement gap, was largely an illusion. Two of my students are currently working with me on a study of this in one large district, but we are months away from releasing a reviewed paper, and it is only one district.

I have argued for years that one of the most glaring faults of our current educational accountability systems is that we do not sufficiently evaluate their effects, instead trusting - evidence to the contrary - that any increase in scores is enough to let us declare success. We should be doing more evaluation not only because it is needed for the improvement of policy, but also because we have an ethical obligation to the children upon whom we are experimenting. Nowhere is this failure more important than in the case of disadvantaged students, who most need the help of education reform.

Inflation is not the only reason why we are not getting a clear picture of changes in the achievement gap. The other is our insistence on standards-based reporting. As I explain in Measuring Up, relying so much on this form of reporting has been a serious mistake for a number of reasons. One reason is that if one wants to compare change in two groups that start out at different levels - poor and wealthy kids, African American and white kids, whatever - changes in the percents above a standard will always give you the wrong answer. This particular statistic confuses the amount of progress a group makes with the proportion of the group clustered around that particular standard, and the latter has to be different for high- and low-scoring groups. I and others have shown that this distortion is a mathematical certainty, but perhaps most telling is a paper by Bob Linn that shows that if you ask whether the achievement gap has been closing, NAEP will give you different answers - very different answers - depending on whether you use changes in scale scores, changes in percent above Basic, or changes in percent above Proficient. This is not because the relative progress has been different at different levels of performance; it is simply an artifact of using percents above standards. This is only one of many problems with standards-based reporting, but in my opinion, it is by itself sufficient reason to return to other forms of reporting.

September 17, 2008

Between a Political Rock and a Statistical Hard Place

Some days, skoolboy feels bad for the hard-working folks in the New York City Department of Education. They’re caught between a political rock and a statistical hard place. The political rock is the New York State accountability system, which complies with No Child Left Behind’s requirements to test students annually in grades 3-8 in Mathematics and English Language Arts, and to classify students, based on their test scores, as either Not Meeting Learning Standards (Level I), Partially Meeting Learning Standards (Level II), Meeting Learning Standards (Level III), or Meeting Learning Standards with Distinction (Level IV), and then aggregate the performance of students, and subgroups of students, to assess the school’s progress toward the goal of 100% proficiency for all students by the year 2014. The mechanism for this is a series of grade-specific exams, with a broad (but arbitrary, as Dan Koretz explains in Measuring Up) standard-setting process that define the scores on the exam that correspond to the four proficiency levels. Whatever a student’s scale score on the exam, he or she is classified into a particular proficiency level.

The statistical hard place is that the proficiency levels are only part of the story. The NYC DOE has found that the scale scores matter, such that a student whose scale score is halfway between the cutoffs for Level II and Level III, and therefore whose proficiency level is Level II, has a higher probability of graduating from high school on time than a student whose scale score is right at the cutoff for Level II. The scale scores have predictive validity—that is, they predict educational outcomes that we think of as important—but they don’t have the political currency of the proficiency levels specified by the state and the federal government.

There’s no evidence, to skoolboy’s knowledge, that achieving a proficiency level on NCLB-style exams has any predictive validity over and above the scale scores on which they are based. (Another regression discontinuity design study waiting to happen.) But I’ll wager that they don’t.

Whether or not the state/NCLB proficiency levels matter, the NYC DOE is stuck. They have to pay homage to the state standards, even though their internal evidence shows that partial progress—“learning quite a bit,” in skoolboy’s terms—really does matter for students’ futures, and therefore is something that schools should be held accountable for.

And I don’t disagree. I would be comfortable (though not ecstatic) with school progress reports that used changes in scale scores to quantify how much students had learned from one year to the next, under two conditions: (a) if the exams were vertically linked, and (b) if the uncertainty in the estimates of school-level effects on the average change were taken into account. Neither of these conditions is met in the current New York City School Progress Reports.

Navigating the political rock and the statistical hard place is definitely a challenge, both rhetorically and in the construction of the School Progress Reports. Rhetorically, the DOE is obliged to argue that a student who is Level III in fourth grade and Level II in fifth grade has lost ground—that student has fallen off of the sharp Level III cliff—because the state and federal accountability metrics treat this as a sharp discontinuity. But as a practical matter, the student may not have fallen off a cliff; rather, she may be just a little bit lower on a gradual hill in fifth grade than we’d like, but still higher on the hill than she was in fourth grade--and the DOE’s internal analyses document that anyone who is higher on the hill is better off than someone lower.

What’s the DOE to do? Well, it could continue to escalate the rhetoric directed toward its critics. (I note with alarm that the DOE went from calling me by my blogging name “skoolboy” on Monday to calling me “Professor Pallas of Teachers College” on Wednesday—whose proclivity to giving A’s to all of his students will come as a surprise to many of them—what’s next? Examining my teeth?) Or it could speak honestly and openly about the challenge of incorporating political and technical realities into the School Progress Reports. I think readers know which path skoolboy recommends.

Guest Blogger Daniel Koretz on New York City's Progress Reports

Koretz.jpg
Daniel Koretz is a professor who teaches educational measurement at the Harvard Graduate School of Education. He is the author of Measuring Up: What Educational Testing Really Tells Us. Below, he weighs in on the NYC Progress Reports that were released yesterday.

eduwonkette: One of the key points of your book is that test scores alone are insufficient to evaluate a teacher, a school, or an educational program. Yesterday, the New York City Department of Education released its Progress Reports, which grade each school on an A-F scale. 60 percent of the grade is based on year-to-year growth and 25 percent is based on proficiency, so 85 percent of the grade is based on test scores. Do you have any advice to New Yorkers about how to use - or not to use - this information to make sense of how their schools are doing?

Koretz: This is a more complicated question in New York City than in many places because of the complexity of the Progress Reports. So let’s break this into two parts: first, what should people make of scores, including the scores New York released a few weeks ago, and second, what additional should New Yorkers keep in mind in interpreting the Progress Reports?

In the ideal world, where tests are used appropriately, I give parents and others the same warning that people in the testing field have been offering (to little avail) for more than half a century: test scores give you a valuable but limited picture of how kids in a school perform. There are many important aspects of schooling that we do not measure with achievement tests, and even for the domains we do measure—say, mathematics—we test only part of what matters. And test scores only describe performance; they don’t explain it. Decades of research has repeatedly confirmed that many factors other than school quality, such as parental education, affect achievement and test scores. Therefore, schools can be either considerably better or considerably worse than their scores, taken alone, would suggest.

However, there is another complication: when educators are under intense pressure to raise scores, high scores and big increases in scores become suspect. Scores can become seriously inflated—that is, they can increase substantially more than actual student learning. This remains controversial in the education policy world, but it should not be, because the evidence is clear, and similar corruption of accountability measures has been found in a wide variety of different economic and policy areas (so widely that it goes by the name of “Campbell’s Law”). High scores or big gains can indicate either good news or inflation, and in the absence of other data, it is often not possible to distinguish one from the other. As you know, this was a big issue in New York City this year, in part because some of the gains, such as the increase in the proportion at Levels 3-4 in 8th grade math, were remarkably large.

New York City is a special case. It is always necessary to reduce the array of data from a test to some sort of indicators, and NYC has developed its own, called the Progress Reports, which assign schools one of five grades, A through F. My advice to New Yorkers is to pay attention to the information that goes into creating the Progress Reports but to ignore the letter grades and to push for improvements to the evaluation system.

The method for creating Progress Reports is baroque, and it is hard to pick which issues to highlight in a short space. The biggest problems, in my opinion, lie in the estimation of student progress, which constitutes 60% of the grade. The basic idea is that a student’s performance on this year’s test is compared to her performance in the previous grade, and the school gets credit for the change. It sounds simple and logical, but the devil is in the details. (For a non-technical overview of the issues in using value-added models to evaluate teachers and schools, see “A Measured Approach”.)

To keep this reasonably brief, I’ll focus on three problems. First, the tests are not appropriate for this purpose. skoolboy made reference to part of this problem in a posting on your blog. To be used this way, tests in adjacent grades should be constructed in specific ways, and the results have to be placed on a single scale (a process called vertical linking). Otherwise, one has no way of knowing whether, for example, a student who gets the same score in grades 4 and 5 improved, lost ground, or treaded water. The tests used in New York were not constructed for this purpose, and the scale that NYC has layered on top of the system for this purpose is not up to the task.

And that points to the second problem, which again skoolboy noted: the entire system hinges on the assumption that one unit of progress by student A means the same amount of improvement in learning as one unit by student B. This is what is called technically an interval scale, meaning that a given interval or difference means the same thing at any level. Temperature is an interval scale: the change from 40 to 50 degrees signifies the same increase in energy as the change from 150 to 160. There is no reason to believe that the scale used in the Progress Reports is even a reasonable approximation to an interval scale. It starts with the performance standards, which are themselves arbitrary divisions and cannot be assumed to be equal distances apart. The NYC system assigns to these standards new scores that nonetheless assume that the standards are equidistant—so, for example, a school gets the same credit for moving a student from Level 1 to Level 2 as for moving a student from Level 2 to Level 3. Moreover, the NYC system assumes that a student who maintains the same level on this scale has made “a year’s worth of progress.” That assumption is also unwarranted, because standards are set separately by grade, and there is no reason to believe that a given standard, say, Level 3, means a comparable level of performance in adjacent grades. (There is in fact some evidence to the contrary.)

The result is that there is no reason at all to trust that two equally effective schools, one serving higher achieving students than another, will get similar Progress Report grades. Moreover, even within a school, two students who are in fact making identical progress may seem quite different by the city’s measure. There may be reasons for policymakers to give more credit for progress with some students than for progress with others, but if one does that, you no longer have a straightforward, comparable measure of student progress.

And finally, there is the problem of error. People working on value-added models have warned for years that the results from a single year are highly error-prone, particularly for small groups. That seems to be exactly what the NYC results show: far more instability from one year to the next than could credibly reflect true changes in performance. Mayor Bloomberg was quoted in the New York Times on September 17 as saying, “Not a single school failed again. That’s exactly the reason to have grades…It’s working.” This optimistic interpretation does not seem warranted to me. The graph below shows the 2008 letter grades of all schools that received a grade of F in 2007. It strains credulity to believe that if these schools were really “failing” last year, three-fourths of them improved so markedly in a mere 12 months that they deserve grades of A or B. (The proportion of 2007 A schools that remained As was much higher, about 57 percent, but that was partly because grades overall increased sharply.) This instability is sampling error and measurement error at work. It does not make sense for parents to choose schools, or for policymakers to praise or berate schools, for a rating that is so strongly influenced by error.

We should give NYC its due. The Progress Reports are commendable in two respects: considering non-test measures of school climate, and trying to focus on growth. Unfortunately, the former get very little weight, and the growth measures are not yet ready for prime time.

2008 Letter Grades of Schools that Received an F Grade in 2007

NYC%20F%20schools.png

September 14, 2008

Let the Spin Begin

top.gif

Suppose that your fourth-grader takes a state test that shows that she understands the associative property of multiplication, can multiply two-digit numbers by two-digit numbers, and can find the perimeter of a polygon by adding up the length of the sides. A year later, as a fifth-grader, she takes a test that shows that she can compare fractions and decimals using <, > or =; identify the factors of a given number; simplify fractions to their lowest terms; and knows that the sum of the interior angles of a quadrilateral is 360 degrees—but she cannot yet create algebraic or geometric patterns using concrete objects or visual drawings (e.g., rotate and shade geometric shapes). Would you say that your child had lost ground in proficiency, or actually gone backward?

Jim Liebman would. Liebman, the Columbia University law professor on leave as Chief Accountability Officer at the New York City Department of Education, is quoted and paraphrased in an article by Jim Dwyer in Saturday’s New York Times on the F grade that P.S. 8 in Brooklyn Heights will receive in this year’s School Progress Reports—a grade that many are finding hard to believe, given that 80% of the students tested in the school are judged proficient in math, and two-thirds are judged proficient in English Language Arts. Doubly embarrassing, in that Chancellor Joel Klein and Mayor Mike Bloomberg have publicly declared the school to be successful and worthy of emulation.

So the spinmeisters are out, and the spin here is justifying the grade of F by arguing that the children in P.S. 8 are going backward. “You drop them off at the beginning of the year, and on average, by the end of the year, your child lost ground in proficiency,” Dwyer quotes Liebman as saying. “Where was the child last year, and where is the child this year?” Liebman asked. “You’re comparing them to themselves.”

A gentle reminder to Mr. Liebman, who was hired in January, 2006: the state math and ELA tests which children take, and are the primary basis for assigning these lovely letter grades, are not vertically equated. (See skoolboy's testing primer here.) This means that there is no basis for comparing performance on the fourth-grade test with performance on the fifth-grade test. For each test, there is a subjective judgment about what level of performance constitutes proficiency, but the tests are independent. There is no basis for claiming that children are going backward; there’s no justification for claiming that a child “lost ground in proficiency,” since proficiency doesn’t exist in the abstract, but rather in grade-specific skills; and the children are not being compared to themselves, but rather their location in the distribution of children’s performance in one year is being compared to their location in the distribution of children’s performance the following year.

Perhaps Jim Liebman simply misspoke, as perhaps did Chancellor Joel Klein when he referred to statistical significance as “playing something of a game.” Such missteps might arise from the tremendous pressure to justify a particular high-stakes evaluation of a school when there are multiple sources of information about school performance that point in different directions—NCLB status, achievement levels, gains, school quality reviews, not to mention the public pronouncements of Liebman’s boss, and his boss’s boss.

There’s nothing wrong, in skoolboy’s view, in looking at students’ achievement growth as one of several criteria for judging how well a school is doing in relation to other schools. But I would never think of using year-to-year changes in proficiency levels on just two tests as the primary basis for evaluating a school’s performance. And neither would most people who study testing and assessment for a living.

September 12, 2008

Cool People You Should Know: Doug Downey

Doug-Downey.jpg
To many observers of public education, there is no doubt about which schools are failing - it's the schools with low rates of students passing state tests, stupid!

Of course, this assumes that students' achievement is a direct measure of school quality. "Yet we know that this assumption is wrong....It follows that a valid system of school evaluation must separate school effects from nonschool effects on children's achievement and learning" writes Doug Downey, a cool Ohio State sociologist of education you should know, in his recent paper (in collaboration with Paul von Hippel and Melanie Hughes), "Are 'Failing' Schools Really Failing?"

Analyzing data from the Early Childhood Longitudinal Study - Kindergarten Cohort, a national sample of 21,000 kindergarteners that were then followed through 5th grade, Downey and colleagues thus set out to isolate the effects of schools on student learning. The ECLS data are uniquely suited for this task because the study evaluated students in the fall and spring of kindergarten, and again in the fall and spring of first grade. It turns out that summers - a time when students are only affected by non-school influences - are the key to teasing apart school and nonschool factors.

Downey and colleagues look at schools' effectiveness in four different ways. First, they examine NCLB's method - overall test score levels. They then turn to 12-month learning rates; think growth models, which measure test score growth, for example, between a test given in April 2007 and a test given in April 2008. They contrast those rates with 9-month learning rates; imagine a test given in September, and then again in May. Finally, they introduce a measure called impact, which is the difference between the school year and summer learning rate.

"Impact" is attractive because it doesn't require us to measure and statistically control for all of the different aspects of children's nonschool environments that may affect school success, as do cardiac surgery report cards. It captures what we need to know about students' out-of-school environments without bogging us down in the methodological and political problems associated with introducing these controls. And it helps us adjust for "soft" factors like innate student motivation, for which it is difficult to measure and control. Moreover, it holds schools harmless for what happens to their students over the summer, which currently serves as a confounding factor in growth models.

What percent performing in the bottom 20% of overall achievement are actually in the bottom 20% for measures of impact and learning? Less than half! High-achieving schools are concentrated in more affluent communities, but "high impact" schools exist across the socioeconomic spectrum. And the opposite is true. There are plenty of school with good test scores that are skating by because simply because they had advantaged kids to begin with.

What does this all mean for NCLB? Downey and colleagues put it like this:
Our results raise serious concerns about the current methods that are used to hold schools accountable for their students' achievement levels. Because achievement-based evaluation is biased against schools that serve the disadvantaged, evaluating schools on the basis of achievement may actually undermine the NCLB goal of reducing racial/ethnic and socioeconomic gaps in performance. If schools that serve the disadvantaged are evaluated on a biased scale, their teachers and administrators may respond like workers in other industries when they are evaluated unfairly - with frustration, reduced, effort, and attrition. Under a fair system, a school's chances of receiving a high mark should not depend on the kinds of students the school happens to serve.
Crystal clear, creative thinking is the distinguishing feature of Downey's work - see, for example, his paper on school effects on child obesity, or his paper asking if schools are "the great equalizer."

Wonks can rest a little easier tonight with the knowledge that Downey's now turned his attention to NCLB.

September 9, 2008

Lessons for No Child Left Behind from "No Cardiac Surgery Patient Left Behind"

heart_art.jpg
New AYP numbers are out, folks. In California, only 48% of schools made AYP, and only 34% of middle schools did so. In Missouri, only about 40% of schools made AYP. Pick almost any state, and you'll see that there are soaring numbers of schools designated as "in need of improvement." With numbers like these, it's worth considering whether NCLB's measurement apparatus is accurately identifying "failing schools."

One way to get leverage on this question is to consider how other fields approach the issue of accountability. Doctor and hospital accountability for cardiac surgery - also the topic of a NYT commentary today - is instructive in this regard. Borrowing heavily from previous work, let me outline how state governments have approached doctor and hospital accountability in medicine. In subsequent posts this week, I'll write about the outcomes of medical accountability systems, as well as some of their unintended consequences.

Medicine makes use of what is known as “risk adjustment” to evaluate hospitals’ performance. Since the early 1990s, states have rated hospitals performing cardiac surgery in annual report cards. The idea is essentially the same as using test scores to evaluate schools’ performance. But rather than reporting hospitals’ raw mortality rates, states “risk adjust” these numbers to take patient severity into account. The idea is that hospitals caring for sicker patients should not be penalized because their patients were sicker to begin with.

In practice, what risk adjustment means is that mortality is predicted as a function of dozens of patient characteristics. These include a laundry list of medical conditions out of the hospital’s control that could affect a patient’s outcomes: the patient’s other health conditions, demographic factors, lifestyle choices (such as smoking), and disease severity. This prediction equation yields an “expected mortality rate”: the mortality rate that would be expected given the mix of patients treated at the hospital.

While the statistical methods vary from state to state, the crux of risk adjustment is a comparison of expected and observed mortality rates. In hospitals where the observed mortality rate exceeds the expected rate, patients fared worse than they should have. These “adjusted mortality rates” are then used to make apples-to-apples comparisons of hospital performance.

Accountability systems in medicine go even further to reduce the chance that a good hospital is unfairly labeled. Hospitals vary widely in size, for example, and in small hospitals a few aberrant cases can significantly distort the mortality rate. So, in addition to the adjusted mortality rate, confidence intervals are reported to illustrate the uncertainty that stems from these differences in size. Only when these confidence intervals are taken into account are performance comparisons made between hospitals.

Contrast this approach with that used by the New York City Department of Education's progress reports, where "point estimates" are used to array schools on an A-F continuum with no regard for measurement error. Readers know well that your friendly neighborhood "statistical nut" has no beef with the use of sophisticated statistical methods to compare schools. But I would just ask that we have some humility about what these methods can and cannot do. (Sidenote: The only winners when we ignore these issues are educational researchers, who can then write regression discontinuity papers using these data. Thanks for the publications, Joel and Mike!)

And it's quite eye-opening to compare the language used by state and federal governments used to explain their accountability systems with the rhetoric we hear in education. Consider this statement from the Department of Health and Human Services to explain the rationale behind risk adjustment:
The characteristics that Medicare patients bring with them when they arrive at a hospital with a heart attack or heart failure are not under the control of the hospital. However, some patient characteristics may make death more likely (increase the ‘risk’ of death), no matter where the patient is treated or how good the care is. … Therefore, when mortality rates are calculated for each hospital for a 12-month period, they are adjusted based on the unique mix of patients that hospital treated.
If you replace the word "hospital" with "school" above, you can imagine the reception this statement would receive in the educational accountability debate. Soft bigotry of low expectations, and you probably kill baby seals for fun, too.

Readers, why is the educational debate so different? Full disclosure: I will shamelessly appropriate your thoughts in my dissertation, which attempts to answer this question, and also establish the effects of each of these systems on race, gender, and socioeconomic inequalities in educational and health outcomes.

September 7, 2008

Predicting the Near Future*

question_marks.jpg

Sometime soon, with great fanfare, the New York City Department of Education will release this year’s School Progress Reports. (Word on the street is that schools already know their grades.) The School Progress Reports, for better or worse, are the centerpiece of the NYC accountability system. (skoolboy thinks for worse, but more on that later.)

The DOE has made a number of changes to the Progress Reports for this second iteration, and I think that eduwonkette had something to do with that (as did other critics and analysts outside of the Tweed inner circle.) We can expect to see separate letter grades for the three major dimensions on which the Progress Reports are based: school environment (including attendance, and parent, teacher and student surveys), student performance, and student progress. But the overall format appears to be unchanged: most of the grade is based on student progress on test scores, and such gains are not very reliable from one year to the next. There is, in skoolboy’s opinion, a false sense of precision conveyed by these letter grades, as they are based on components that are measured with error, but that measurement error is not reflected in how the grades are calculated. And I’m particularly annoyed at the misuse of social surveys for accountability purposes.

Nevertheless, the DOE is marching onward, and we’ll have this year’s grades to pore over in the near future. (And you can bet that eduwonkette will put on the green eyeshade for this, even though it clashes with her cape and mask.) How many schools will improve their grade from last year to this year? How many will fall? It’s time to make some predictions. What do you think, readers?

Here's a five-by-five table designed to show how this year’s grades are associated with last year’s grade. Each column represents last year’s grade, and each row represents a possible outcome for this year. The column percentages will add up to 100%. Try to fill in the blanks: What percentage of the schools that received A’s last year will receive an A this year? What percentage of A’s will decline to B’s? What fraction will fall further to C’s, D’s, and F’s? At the other end of the spectrum, what percentage of last year’s F’s will remain F’s? What percentage will climb out of the cellar to obtain a D? Will any make the leap from F to A?

crosstab.JPG

As a reminder, last year, about 23% of schools received an A; 38% received a B; 26% received a C; 8% received a D; and 4% (i.e., 53 schools) received an F.

A caveat: The DOE knows that the legitimacy of the School Progress Reports depends on the grades not being too volatile from year to year. If 75% of last year’s A’s became F’s this year, no one would take this scheme seriously. (And if schools that everyone views as exemplary or high-performing got middling grades, this too would call the scheme’s legitimacy into question. So don't expect Stuyvesant High School to get a C.) There may not be very much fluctuation from last year to this. You can be sure that the DOE has constructed this year’s scores so that there’s not too much instability from last year to this year.

But since we believe in incentives on this blog, the reader who comes closest to the actual association between last year and this year shall receive a prize to be selected by eduwonkette—and we know how creative she can be. Be sure to fill in all 25 blanks.

*Employees of Tweed Courthouse, KPMG Consulting, and the Parthenon Group are ineligible for this contest.

August 27, 2008

Guest Blogger Bruce Fuller: The Benefits and Dilemmas of Centralized Accountability

Bruce Fuller, sociologist and professsor of education and public policy at the University of California - Berkeley, has co-edited a new book, Strong States, Weak Schools: The Benefits and Dilemmas of Centralized Accountability. Below, he provides a Q&A on the book’s findings.

Q. Media reports summed-up your findings by saying that teacher responses to the No Child Left Behind Act and state accountability efforts have been “haphazard”, and teachers are feeling demoralized. Didn’t we know this already?

A. We do know that teacher associations are eager to revamp No Child following the November elections, and even recraft Washington’s role in education. And the Bush Administration, business groups, and some civil rights advocates claim that No Child is working.

The seven research teams that came together to produce Strong States, Weak Schools set the stage by first showing that student achievement has inched up at a glacial pace since No Child was enacted in 2002, even slowing progress observed in the 1990s, as state-led accountability and school finance reforms were successfully pursued. Progress is more discernible in certain states.

But few researchers have hung out in schools, interviewed teachers and principals, and asked how front-line educators interpret new accountability regimes. This includes how teachers try to address state curricular standards, how they might use more textured data on what students are learning (or not), and the extent to which principals (and their district superintendents) motivate their teachers to focus on improving their pedagogies.

Earlier ethnographic studies tended to be conducted by scholars with a priori agendas, hoping to detail how teachers feel overly controlled by accountability measures, or how teachers held deep affection for them. Instead, our seven contributing teams probed different parts of the implementation elephant. Do front-line educators in elementary versus secondary schools hold different viewpoints? Do exit exams prompt different responses inside our high schools? Do the rules and tools of accountability programs operate differently to boost average student achievement, in contrast to factors that narrow racial gaps inside schools?

Q. So, does teacher resistance to top-down accountability programs help to explain the tepid gains in student test scores?

All seven teams found that teachers and principals have redoubled their efforts to assist low-performing students, in part because of accountability programs advanced from either state capitals or Washington. The spotlight placed on how student subgroups are doing, the availability of richer data on individual student competencies, and the threat of sanctions are motivating teachers to buckle down and collaborate to devise new pedagogical approaches and build stronger relationships with students.

Yet two factors constrain whether teacher responses are coordinated and effective over time. First, the RAND study, led by Laura Hamilton, found that the attention that teachers pay to curricular standards, whether they study student data, and the value they place on accountability pressures vary enormously within schools. The good news is that teachers in poor communities are not more or less responsive to accountability rules and tools, compared to those in middle-class neighborhoods. The bad news is that teacher responses are highly variable and eclectic within schools. This suggests that relatively few principals motivate their staff to pull in the same direction and employ new training and data tools that accountability programs often support.

Second, the uneven leadership of district superintendents and the stickiness of school institutions – especially high schools – tend to disempower principals. Tom Luschei and Gayle Christensen probed deep into these dynamics, hanging out over time in a few districts. They found that district leaders often respond to accountability demands in ritualized fashion, failing to work intensively with their principals to mobilize rules and tools. Two studies of high school responses, appearing in Strong States, Weak Schools, detail how growth targets, program improvement triggers, and exit exams turn teacher attention to low-achieving adolescents. But these individual-level responses rarely lead to innovative structural change in balkanized high schools.

Q. What is working to motivate teachers and raise student achievement, then?

Two studies in the book offer insights here: Melissa Henne and Heeju Jang examined what worked in 111 California elementary schools as they variably succeeded in closing achievement gaps between Anglo and Latino students. They show that disparities narrow when teachers report that their principal motivates staff to focus on raising achievement and delivers tools that make everyone feel efficacious. This is not simply a mechanical process: more equitable schools have teachers who report strong, respectful relationships with their principal and colleagues.

And Soung Bae went deeper into a California school district that had narrowed ethnic achievement gaps over time. She discovered district leaders who banked heavily on inservice teacher training – hammering on state curricular standards and inventive pedagogies. Then, district staff followed teachers back into their classrooms to provide ample clinical follow-up.

Q. So, what do these implementation studies say to state and federal policy makers who will soon be debating changes in accountability programs?

Pay attention to what motivates teachers, who, like other professionals, seem eager to pursue shared goals if they are trusted to improve their craft. The link between district staff and principals appears to be key. If district leaders are simply messengers of government – with little agility in adapting to rules and mobilizing tools – then their principals will have less capacity to motivate their teachers.

Teachers do report enormous dissatisfaction, at least in California, Georgia, and Pennsylvania, in being forced to ignore certain subjects and topics if they do not appear on state tests. Somehow, policy makers must face the sharp-edged dilemma of simplifying tests and the curriculum, while recognizing that tying the hands of teachers may erode everyone’s motivation.

All seven empirical studies can be viewed here.

August 21, 2008

Cool People You Should Know: David Figlio

David-Figlio-Card.jpg
Economist David Figlio, who has extensively studied the intended and unintended consequences of accountability systems, recently made a move from the University of Florida over to Northwestern. Figlio has a knack for the creative - but still substantive - paper: for example, see his papers on the unintended consequences of accountability systems including Food for Thought? The Effects of School Accountability Plans on School Nutrition, Accountabilty, Ability, and Disability: Gaming the System?, and Testing, Crime, and Punishment. More recently, he mounted an impressive survey of Florida principals to identify their responses to accountability pressures. (See Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Programs.)

In our chat on testing and accountability on Tuesday, Figlio provided a terrific overview of the accountability literature in response to Sherman Dorn's question, which is worth reprinting in full here:
I think that the evidence is becoming clearer that many of the hopes of high-stakes accountability advocates and many of the fears of high-stakes accountability critics are correct -- school administrators and teachers can and do respond to accountability pressures, at least at the margins.

A number of recent studies have shown that schools subject to greater accountability pressure tend to improve student test performance in reading and mathematics to a meaningful degree -- my recent study of Florida with Cecilia Rouse, Jane Hannaway and Dan Goldhaber (working paper on the website of the National Center for the Analysis of Longitudinal Data in Education Research, or caldercenter.org), for instance, suggests test score gains of one-tenth of a standard deviation in reading and math associated with a school getting an "F" grade relative to a "D" grade. We find that these test score gains persist for several years after the student leaves the affected school. Jonah Rockoff of Columbia University has a new working paper studying New York City's rollout of school grades that suggests that responses to grading pressure seem to happen immediately -- grades released in November were mainfested in test score changes in the same winter/spring.

In the case of my study with Rouse, Hannaway and Goldhaber, we try to look inside the "black box" by studying a wide variety of potentially productive school responses, and it appears that Florida schools responded to accountability pressures by changing some of their instructional policies and practices, rather than "gaming the system."

The rapid and apparently productive response of school personnel to school accountability pressure suggests that educators are, at least to some degree "magisters economici," responding to the incentives associated with the system. And this makes getting the system right so important, because if schools and teachers respond quickly to incentives, the incentives had better be what society/policymakers want.

Many people raise concerns about teaching to the test, and there is certainly evidence of this -- consistently, estimated effects of accountability on high-stakes tests are larger than those on low-stakes tests -- though the low-stakes test results tend to be meaningful still, especially with respect to math. Harder to get a handle on is the narrowing of the curriculum to concentrate on the measured subjects; there is a lot of suggestive evidence that this is taking place to a small degree at the elementary level, though studies of the effects of accountability on performance on low-stakes subjects typically don't find that performance on these subjects suffers -- but of course, those subjects are still being measured with tests. Still there is certainly the incentive to reduce focus on "low-stakes" subjects. One possible solution for those concerned about low-stakes subjects being given short shrift would be to impose requirements such as minimum time spent of instruction or portfolio reviews.

There is a lot of evidence that accountability systems can have unintended consequences that are predicted by the magister economicus model. Derek Neal and Diane Whitmore Schanzenbach at the University of Chicago note that accountability systems based on getting students above a given performance threshold tend to induce schools to focus on the kids on the "bubble." I've found that that type of system may lead schools to employ selective discipline in an apparent attempt to shape the testing pool, or even to utilize the school meals program to artificially boost student test performance by "carbo-loading" students for peak short-term brain activity. These types of unintended consequences are much more likely in accountability systems based on the "status" model of getting students above a proficiency threshold, rather than the "gains" model of evaluating schools based on how much these students gain.

But there's a tradeoff here. The more we evaluate schools based on test score gains, where gaming incentives are lower, the more the focus is taken off of poorly-performing students whom society/policymakers would like to see attain proficiency. How the system is designed is crucially important.
You can find the transcript for the chat on testing and accountability here.

August 15, 2008

Join a Chat about Testing and Accountability in the NCLB Era: Tuesday, August 19th, 3-4pm

chatty1.jpg
On Tuesday, David Figlio - an economist who does great work on the intended and unintended consequences of accountability systems - and I will chat with Ed Week readers about testing and accountability. The event description is below, and you can submit questions here:
Raising student achievement has long been a major issue in the American public education system. But with the advent of the No Child Left Behind Act and its testing mandates, even more attention has been directed towards this issue. As states release their annual school report cards, testing and accountability have once again emerged as hot topics of debate, with New York City Public Schools receiving considerable scrutiny of late.

Consequently, many observers have questioned whether state testing and accountability systems are accurately depicting student performance and the size of the achievement gap between groups.

July 16, 2008

The Vision Vacuum

"You're too young to be this cynical, " he said, staring across his desk at me with a perplexed half smile.

I was 10, and in the middle of our classroom's simulated presidential campaign in which we followed the election and voted for candidates, my 5th grade teacher had launched into a pep talk about the potential for real change.

The last eight years have done little to temper my built-in skepticism. These are dark times, Diane Ravitch reminds us this morning. If I saw the glass as half-empty when Bush assumed the presidency, I now see it as half full - with poison.

That spin has taken over education policymaking hasn't helped. Accountability, as we used to talk about it back in the 1990s, was a way to evaluate reforms and provide incentives to implement them. It was never intended to be the reform. Now everyone's being tested and rated and graded and held accountable, but no one is supporting schools to improve the day-to-day work of teaching and learning. Policymakers say they want to leave "no child behind," but are willing to deny them health care in their next breath. We've adopted every technocratic solution that newly minted MBAs can come up with, but we have no educational vision.

So it was with cautious optimism that I followed Randi Weingarten's acceptance of the AFT presidency on Monday. As Dan Brown articulates in this post, she's a fighter, and one at the forefront of critiquing our current reform movement's easy slogans, "Too often, testing has replaced instruction; data has replaced professional judgment; compliance has replaced excellence; and so-called leadership has replaced teacher professionalism."

In her acceptance speech, which was bold and unapologetic, she embraced the proposed reforms of the Bolder and Broader coalition, and let us imagine what an alternate educational vision for public schools could look like. Watch the whole speech and let me know what you think - or just take a look at the clip below.

July 2, 2008

Educational Testing: A Brief Glossary

While you’re waiting for Dan Koretz’ book on testing to arrive – I think eduwonkette and I should get some kind of consideration for shilling for this book so often here – here’s a brief skoolboy’s-eye view on testing. Actual psychometricians are welcome to correct what I have to say.

Tests are typically designed to compare the performance of students (whether as individuals, or as members of a group) either to an external standard for performance or to one another. Tests that compare students to an external standard are called criterion-referenced tests; those that compare students to one another are called norm-referenced tests. Even though criterion-referenced tests are intended to hold students’ performance up to an external standard, there is often a strong temptation to compare the performance of individual students and groups of students on such tests, as if they were norm-referenced.

A typical standardized test of academic performance will have a series of items to which students respond, generally either in a multiple-choice or constructed response format, which means that students are constructing a response to the item. There’s usually only one right answer to a multiple-choice item, whereas constructed-response items may be scored so that students get partial credit if they demonstrate partial mastery of the skill or competency that the item is intended to represent. For any test-taker, we can add up the number of right answers, plus the scores on the constructed-response items, to derive the student’s raw score on the test. A test with 45 multiple-choice items would have raw scores ranging from 0 to 45.

For individual test items, we can look at the proportion of test-takers who answered the item correctly, which is referred to as the item difficulty or p-value, which has nothing to do with the p-values used in tests of statistical significance, but rather the proportion (p) of examinees who got the item right. Some test items are more difficult than others, and hence items will have varying p-values.

Raw scores are rarely interpretable, in part because they are a function of the difficulty of the items. For this reason, they are typically transformed into scale scores, which are designed to generate a score that will mean the same thing from one version of a test to the next, or from one year to the next. The scale for scale scores is arbitrary; the SAT is reported on a scale ranging from 200 to 800, whereas the NAEP scale ranges from 0 to 500.

The process of transforming raw scores into scale scores is computationally intensive, generally using a technique known as Item Response Theory (IRT), which simultaneously estimates the difficulty of an item, how well the item discriminates between high and lower performers, and the performance of the examinee. An examinee who successfully answers highly difficult items that discriminate between high and low performers will be judged to have more ability, and hence a higher scale score, than an examinee who gets the difficult items wrong.

There’s no one right way to transform raw scores into scale scores, and it’s always a process of estimation, which is sometimes obscured by the fact that scores are reported as definite quantities. (A little skoolboy editorializing here…) The expansion of testing hastened by NCLB has placed a lot of pressure on states, and their testing contractors, to construct scale scores for a test that represent the same level of performance from one year to the next (a process known as test equating). Much of this is done under great time pressure, and shielded from public view. The process is complicated by the fact that states typically don’t want to release the actual test items they use, because then they can’t use them in subsequent assessments as anchor items that are common across different forms of a test, since students’ performance on such items could change due to practice. Some tests are vertically equated, which means that a given score on the fourth-grade version of a test represents the same level of performance as that same score on the fifth-grade version of the test. In a vertically-equated test, if the average scale score is the same for fourth-graders as it is for fifth-graders, we’d infer that the fifth-graders haven’t learned anything during fifth-grade.

Proficiency scores represent expert judgments about what level of scale score performance should describe a student as proficient or not proficient at the underlying skill or competency that the test is measuring. For example, NAEP defines three levels of proficiency for each subject at each of the grades tested (4th, 8th and 12th): basic, proficient, and advanced. Cut scores divide the scale scores into categories that represent these proficiency levels, with students classified as below basic, basic, proficient, or advanced. These proficiency scores do not distinguish variations in students’ performance within the category; one student could be really, really advanced and another just advanced, and whereas a scale score would record that difference, a proficiency score would simply classify both students as advanced. The fact that proficiency levels are determined by expert judgment, and not by the properties of the test itself, means that they are arbitrary; the level of performance designated as proficient on NAEP may not correspond to the level of performance designated as proficient on an NCLB-mandated state test. Many researchers (including Dan Koretz, eduwonkette, and me) are concerned that the focus on proficiency demanded by NCLB accountability policies has the unintended consequence of concentrating the attention of school leaders and practitioners on a narrow range of the test-score distribution, right around the cut score for the category of “proficient,” to the detriment of students who are either well below or well above that threshold. Such a focus is a political judgment, not a psychometric one, and there are arguments both for and against it.

I'll update this as more knowledgeable readers weigh in. If experts in measurement were to judge proficiency thresholds for knowledge about testing, I'd probably be classified as basic; Dan Koretz is definitely advanced. For a lively and readable treatment of these kinds of issues, get his book!

July 1, 2008

An Immodest Proposal

spiffboy2-thumb.jpg

This year’s statewide fourth-grade math exam administered in New York State -- the one with the remarkably high gains -- contained the following item:

“Janice bought a notebook for $3.75 and a pencil for $0.47. She gave the cashier $5.00. How much money did Janice receive in change?”

The item might have looked a little familiar to fourth-grade teachers. In 2007, a similar item appeared:

“Tony bought art supplies that cost $19.31. He gave $20.00 to the cashier. How much money did Tony receive in change?”

And in 2006, an item read:

“Mr. Marvin spent $54.10 on pants and shirts. He gave the cashier $60.00. How much money should Mr. Marvin receive in change?”

Other similarities abound. In 2008, an item read:

During the year, one thousand eight hundred four books were checked out of the school library. What is another way to write this number?

A. 184
B. 1,084
C. 1,804
D. 1,840

There was an uncanny resemblance to an item on the 2007 test:

The number of people who live in Goodwin Falls is three thousand nine hundred eight. What is another way to write the same number?

A. 398
B. 3,098
C. 3,908
D. 3,980

To be sure, the test-takers in 2008 still had to answer these questions correctly to get credit for them. But the similarity in item formats across the years gives some credence to concerns that scores are inflated.

Dan Koretz discusses the problem of score inflation in his excellent new book, Measuring Up: What Educational Testing Really Tells Us. One source of the problem, he explains, is that all tests sample the subject-matter domains that they are supposed to tap. If the same kind of item shows up repeatedly on the test from one year to the next, teachers and administrators can focus on this restricted set of test item types, and neglect other item types that are still part of the domain that the test is intended to represent.

The National Assessment of Educational Progress (NAEP) is sometimes referred to as the “gold standard” for standardized tests, and claims about test score inflation in a test, such as an NCLB-mandated state test, are often grounded in a discrepancy between NAEP and the other test either in the level of or trend in performance . The characterization of NAEP as the “gold standard” reflects the fact that it is designed to measure a much larger sample of student performance in a domain than is the typical state test. No individual child takes all of the items in the NAEP item pool; instead, students complete test booklets with blocks of items. In the 2000 12th-grade mathematics NAEP, for example, students completed one of 26 different test booklets, each containing three 15-minute blocks out of a total of 13 different blocks of mathematics items. Each student was asked to complete about 40 items across the domains of number sense, properties, and operations; measurement; geometry and spatial sense; data analysis, statistics and probability; and algebra and functions.

Overall, enough students respond to all of the items in the NAEP item pool to be able to measure how well the population of students in a state (or large urban district) is doing. But NAEP is not designed to yield scores for individual students, because no student responds to enough items to yield a reasonably precise measure of performance.

With tongue firmly in cheek, skoolboy offers the following solution to test score inflation: more testing. Imagine if students completed the entire pool of NAEP items (or some other broad pool of items assessing performance in a domain), instead of the relatively restricted sample of items used in most state-level testing programs. If students were assessed on a broad array of items tapping subject matter competence, teachers and administrators would not be able to concentrate their attentions on a subset of item types, and hence would not be able to artificially raise students’ scores relative to their true learning of the subject. Sure, the burden of testing would increase; we'd need to invest in better and more expensive tests; and increased testing wouldn't solve the incentive problems that high stakes create.

More testing. An idea whose time has come?

Nah.

June 26, 2008

When Measuring Achievement Gaps, Beware the Proficiency Trap

gap.jpg
Though we can thank the No Child Left Behind Act for drawing our attention to the "achievement gap" - which is now loosely deployed to reference gaps between African-American and white/Asian, poor and advantaged, suburban and urban, or even male and female kids - it's also done us a great disservice by distorting the way that we measure, and think about, differences between groups.

There are at least two ways of thinking about the relationship between achievement and kids' life chances. The first is to consider, in absolute terms, the set of skills that students have. The second views achievement as relative. Most coveted opportunities - jobs, college admission, a good grade in a college course, or positive evaluations in the workplace - are not divvied up based on students crossing an arbitrary line of proficiency or competence. We don't give everyone a job who's passed a basic reading test, nor do we admit everyone to UC-Berkeley who's received more than a 700 on the verbal SAT. Every student in a college course at NYU can't get an A, and faculty measure students' performance against others to assign grades. In short, all of these decisions are made by comparing the performance of those in a pool, and choosing those who come out near the top.

The proficiency view, to my mind, is certainly important to consider when we are thinking about building stocks of human capital. But if we are concerned about inequality and social stratification - ensuring that, on average, every demographic and socioeconomic group is equally prepared to compete in higher education and the workplace - relative achievement measured on a continuous scale is what matters, not proficiency rates.

Which brings us to how we currently measure "achievement gaps" between social groups, and why this method is tragically flawed. For example, if you look at the NYC press release from this year's test scores, you'll see that gaps are defined as the difference between the percentage of students that are proficient in each group. If the gap in proficiency between black and white students was 29 percentage points last year in 4th grade ELA, and now is 26 percentage points, we hear that the gap has narrowed by 3 percentage points. But it's possible that the gap in the achievement that matters - the continuous measure of achievement - has actually grown.

Let me give a brief example to illustrate. If we use the proficiency logic, the achievement gap that separates the Bronx and the affluent suburbs of Westchester is closing. And indeed on Monday, Mayor Bloomberg crowed that NYC is catching up to the suburbs. If we take a look at 7th grade math, we see that there was a 30 percentage point Westchester/Bronx gap in proficiency in 2007 (73% versus 43%), but this year, there is only a 25 percentage point gap (83% versus 58%). If we use a proficiency measure, the achievement gap has closed by 5 percentage points.

Not so fast. The achievement gap, if we measure the differences in the average student scores in Westchester and the Bronx, has actually increased in 7th grade math. The scale score gap was 28 points last year. Put differently, the average Bronx 7th grader scored at the 23rd percentile of the Westchester distribution in 2007. This year, the gap was 30 points. Now the average Bronx 7th grader has dropped to the 21st percentile of the Westchester distribution, even though the achievement gap, as measured by proficiency, is closing.

Take-home point: when you hear about achievement gaps closing based on proficiency scores, beware of what you're being sold.

June 24, 2008

Are New York City Schools Shortchanging High Achieving Students? The View from 2003-2008

MyShip.jpg
Savvy New York City parents have long suspected that high achieving kids are losing out in the push to boost the achievement of the lowest performing students. But those suspicions are often cast aside by public officials as helicopter parent whining or muted class warfare.

But a review of 4th grade test score data from 2003-2008 suggests that these parents have been on to something. Between 2003 and 2008, the fraction of students scoring in the highest achievement level on the 4th grade NY state ELA test has plummeted.

In 2003, 15.6% of 4th graders scored at Level 4. By 2008, only 5.8% did. In other words, the fraction of students scoring at Level 4 in 2003 was about 2.7 times higher than this year. At the same time, the percentage of students scoring at proficiency has increased 9 percentage points, from 52.4% to 61.3%.

Put bluntly, it appears that schools are focusing on pushing lower performing students over the passing mark, and shortchanging high-achieving students in the process. In Bloomberg's New York, as it turns out, a rising tide does not lift all boats.

2003_8%204th%20Grade%20ELA.jpg

You can find the data from 1999-2005 here, and the data from 2006-2008 here. I analyzed 4th grade scores because tests weren't given in grades 3-5 throughout the entire time period. If anyone knows where to find average scale scores at different parts of the distribution over time (i.e. 10th/90th percentile) - I would have preferred to work with these data for all of the reasons suggested below - please let me know.

In NYC Middle Grades, Fewer High Achieving ELA Students, Even As Passing Rates Increase

In grades 5-7, grades that have seen sharp increases in ELA passing rates over the past two years, the percentage of New York City students scoring in the highest performance category has decreased substantially. You can find those results here. Interestingly, this is only true for ELA, not math.

* In 2006, 8.7% of 5th graders scored at Level 4 on the ELA. This year, only 4.3% did.

* In 2006, 7.1% of 6th graders scored at Level 4. This year, only 2.2% did.

* In 2006, 4.7% of 7th graders scored at Level 4. This year, only 1.6% did.

2006_8%20Level%204%20ELA.jpg

Anyone have ideas about what's going on here? Fordham's report on high achieving students in a NCLB era provides some insight, I think.

Scale Score Magic! Why We Shouldn't Rely on Passing Rates to Measure Academic Achievement

rabbit%20hat.jpg
Consider this puzzle: in 2007, the average scale score on the New York State ELA Test was 661. In 2008, it is also 661. Yet the overall level of proficiency has increased by 3 percentage points, from 68% to 71%. How is this possible?

When we measure student achievement solely based on the proportion of students who have jumped over a bar, we can end up with pretty misleading picture of student performance.

Take a look at grades 3, 5, and 8 in the graph below, which shows the change in ELA average scales scores and passing rates for New York state. In each case, the average scale score increased by 2 points, or about .05 standard deviations. But the increases in the percentage of students who were proficient varied widely across those grades. In 3rd grade, there was an increase of 3 percentage points. In 5th grade, there was a much larger increase - 9.5 percentage points. And in 8th grade, though the average scale score increased, the percentage of students who were proficient actually decreased .9 percentage points.

Should we conclude that our 5th graders are much better off than they've been in the past, and 8th graders are falling behind? Definitely not - 5th grade just happened to hit the sweet spot of the distribution - but that's what you'd get if you relied only on passing rates.

2008%20ELA%20Graph%282%29.jpg

In short, know what you're buying when you're looking at passing rates. They can increase substantially by moving a small number of kids up a few points - just enough to clear the cut score. In some of the grade levels above, there are good reasons to suspect that these small moves may partially explain large jumps in proficiency on the New York State ELA test.

June 22, 2008

Our Very Own Disney Movie! The New York State 2008 ELA and Math Results

magic%20kingdom.jpg
I really appreciate the opportunity to join all of you here at Disney World. I can't wait to get over to the Magic Kingdom. I just love cartoon characters; outlandish fairy tales; and wild, stomach-churning roller coaster rides.
-Mayor Bloomberg, Excellence in Action Summit

If you like fairy tales, today is your day. Overnight, the majority of kids in New York City have become proficient readers (up 7 percentage points to 58%) and mathematicians (up 9 percentage points to 74%). Apparently, scores are up even more in Buffalo, Yonkers, and Rochester. Here's Elizabeth Green's article in the NY Sun, Mayor Sees a Test Score Triumph: Or Is It a Case of Inflation of Results? When test scores rise dramatically on one test and are largely flat on the NAEP, we have good reasons to worry that something besides real learning is happening. In this case, it appears that the NY ELA and Math tests were just easier, which drove up scores across the state.

Alas, at the Magic Kingdom, outlandish fairy tales always win the day. Bloomberg is holding a press soiree at P.S. 175 in Harlem this afternoon, and the state is holding its press conference at 11:45. More details to follow...

June 13, 2008

Still a Bobo in Paradise

bobs.jpg
Meet the Status Quo. It includes the Chairman of the Board of the NAACP (Julian Bond), the former president of the Urban League (Hugh Price), a Nobel prize winning economist and expert on early childhood interventions (Jim Heckman), some of the country's most distinguished experts on urban poverty (William Julius Wilson, Christopher Jencks) and educational accountability (Helen Ladd), a well-known professor of pediatrics at Harvard Medical School (T. Berry Brazelton), two former Surgeon Generals (Jocelyn Elders and Richard Carmona), Ernie Cortes (of the Industrial Areas Foundation), school practitioners like Debbie Meier, Ted Sizer, and Jim Comer who have spent their careers challenging the status quo, and too many other people to list here who have dedicated their lives to improving the lives of poor and minority children. And yes, David, they accept your apology.

I really do hate my permanent residence in the reality-based community, but at least half of the achievement gap that exists between black and white students - the fact that the average black 12th grader performs at about the 16th percentile of the white distribution (a gap of about 1 standard deviation)- cannot possibly be attributed to the K-12 schools. Why? The average black student enters kindergarten testing at about the 25 percentile of the white distribution in math (a gap of .663 standard deviations), and the 35th percentile of the white distribution in reading (a gap of .4 standard deviations). "Squeezing teachers," "dealing with teachers who don't teach," or "holding teachers feet to the fire," I'm sorry to say, are not going to address that gap. And between kindergarten and 12th grade, kids are only in school 22% of their waking hours. It turns out that poor students' slower rate of learning in the summer plays a significant role in increasing existing gaps.

Of course schools play a role in exacerbating these problems - no one said they don't - in particular because of the unequal distribution of teachers across schools. We can all acknowledge that this distribution of teachers is a partial legacy of contract rules - still in place in many districts - that gave preference to senior teachers. Both coalitions are concerned with attracting and retaining good teachers in hard to staff schools, and perhaps they can find some common ground there.

But it would be great if we grounded this discussion in some basic facts - facts that might include the current distribution of school effects, and how much of the achievement gap we could expect to see narrowed if we move a student from a below to an above average school (critical for the school choice question); how modest the effects of accountability have historically been on gaps (very little action at all on the black-white gap - Texas also comes to mind), and how more "vigorous accountability" will differ in ways that produce different outcomes; how much of the gap is a function of school-year versus summer learning; and how much of the gap is there when kids start school.

June 11, 2008

Why We Should Care About Test Score Inflation

nailed_first_inflation_s.jpg
Kevin Carey’s dismissal of “test score inflation” provides an ideal opportunity to talk about the book I finished this weekend, Measuring Up: What Educational Testing Really Tells Us, by Dan Koretz, a psychometrician at the Harvard Grad School of Education – hardly an opponent of testing.

Koretz calls “test score inflation,” in which gains on tests used for accountability dramatically outpace gains on low stakes tests, the “dirty secret of high-stakes testing.” If you compare NAEP trends and state score trends, you’ll see that state scores have increased significantly more than NAEP scores since NCLB was adopted.

To understand why test score inflation is a serious problem, you have to understand the sampling principle of testing. Koretz provides the following example: Suppose we want to evaluate students’ vocabulary. A typical high school student knows 11,000 root words, but a test can only include a sample of these words – maybe 40. If we design our test well, we can still learn something about the breadth of each student’s vocabulary. But we don’t really care if the student knows the 40 words on the test; rather, we care about the larger domain from which these words are sampled.

Now imagine that for weeks before our test, I drilled students incessantly on those 40 words. Voila! They perform exceptionally on the test. Yes, their vocabularies have increased by 40 words. Maybe these are 40 really important words - the so-called "test worth teaching to." But proficiency in the domain that my test is intended to measure has not expanded by the same amount. I’ve seen this over and over again; administrators and teachers figure out which concepts are consistently on the test, and which aren’t, and they alter their instruction accordingly. The trouble is that if we administer a slightly different test, drawing on a broader range of concepts from the domain we care about, kids haven't mastered them.

Carey explains that this is just a standards mismatch problem - i.e. state test standards are not the same as those used on national tests. Koretz takes Carey’s critique head on in this passage:

"Alignment is a lynchpin of policy in this era of standards-based testing. Tests should be aligned with standards, and instruction should be aligned with both....And alignment is seen by many as insurance against score inflation. For example, a principal of a local school that is well known for the high scores achieved by its largely poor and minority students gave a presentation to the Harvard Graduate School of Education a few years ago. At one point, she angrily denounced critics who worry about 'teaching to the test.' We had no reason to be concerned about teaching to the test in her school, she asserted, because the state’s test measures important knowledge and skills. Therefore, if her faculty teaches to the test, students will learn important things.

This is nonsense, and I have a hunch about what I would find if I were allowed to administer an alternative test to her students. Alignment is just reallocation by another name. Certainly it is better to focus instruction on material that someone deems valuable, rather than frittering time away on unimportant things. But that is not enough. Whether alignment inflates scores depends also on the importance of the material that is deemphasized. And research has shown that standards-based tests are not immune to this problem. These tests too are limited samples from larger domains, and therefore focusing too narrowly on the content of the specific test can inflate scores." (p. 253-254)

We only care about test scores if they translate into general improvements in children’s academic skills that generate meaningful improvements in their life chances. If these gains don’t translate to tests that measure similar skills – basic reading and math competencies - what are the chances that they are going to help them succeed in the workplace or in college? And that is a very good reason to worry about test score inflation.

Spoiler alert: NY state test scores are out next week, if not sooner. What should we make of NYC's flat NAEP scores alongside state test improvements so large they're unbelievable? Kind of makes you wonder.

June 10, 2008

Bold and Broad Brain Scan: It's Not an Either/Or, and No One Said It Was!

rr.jpg
It didn't take long for the blogosphere to use its heralded mind reading abilities to accuse the Broader/Bolder campaign of advancing reforms traditionally outside of the K-12 system at the expense of K-12 reform.

Read the statement. It said no such thing. In fact, the report argues for continued school improvement efforts:

To close achievement gaps, we need smaller classes in early grades for disadvantaged children; to attract high-quality teachers in hard-to-staff schools; improve teacher and school leadership training; make college preparatory curriculum accessible to all; and pay special attention to recent immigrants.

June 5, 2008

A Texas Tall Tale Remembered, and Demolished, One More Time

Paige%2002-23-04.jpg
In December 2000, the New York Times introduced us to the president elect's choice for Secretary of Education, a former football coach with a penchant for "snake-, lizard-, ostrich- or alligator-skin boots." In that article, Jacques Steinberg reported that under his leadership as the superintendent of the Houston Independent School District, Rod Paige "helped nudge test scores steadily upward in the Houston district, which is largely black and Hispanic. It now ranks among the highest-performing in the state." Houston, the commentators cooed, was nothing short of a miracle. In 2002, the district won the first Broad Prize for Urban Education.

By 2003, the press - and the Texas Education Agency - started looking more closely at Houston's results. In the Times first article on the Houston miracle, "Questions on Data Cloud Luster of Houston Schools," Diana Schemo wrote, "Now, some here are questioning whether the miracle may have been smoke and mirrors, at least on the high school level. And they are suggesting that perhaps Houston is a model of how the focus on school accountability can sometimes go wrong, driving administrators to alter data or push students likely to mar a school's profile -- through poor attendance or low test scores -- out the back door."

Ten days later, the Times editorial page wrote that Paige "owes it to the country to share his thoughts on how this happened and what it means." In an interview with the Times editorial board a few days later, Paige defended his record. Gains in student achievement were real and "still standing," though he said ''there probably was'' a dropout problem.

But the cat was out of the bag. By December the Times had acquired test score data - both on the Texas TAAS and the nationally normed Stanford tests - and established that Houston's state test score gains were enormously inflated. In other words, Houston's sizable gains on the Texas test largely evaporated on the Stanford 9. In August 2004, 60 Minutes ran a segment on the Texas Miracle. When the Dallas Morning News uncovered widespread cheating in Houston late in 2004, it appeared that the game was finally over.

In this month's issue of Educational Evaluation and Policy Analysis, a new study by UT-Austin professor Julian Vasquez-Heilig and Linda Darling-Hammond, "Accountability Texas-Style: The Progress and Learning of Urban Minority Students in a High-Stakes Testing Context," revisits the Houston miracle by analyzing years of student-level test score and graduation data (1995-2002). There's no version up on the web yet, but here are some key findings:

* Growth on scores on TAAS exam outpaced scores on the Stanford exam. This appears to be prima facie evidence of test score inflation.

* Low-scoring students were excluded from taking the TAAS, both through special education and language exemptions and grade retention.

* A key strategy for improving test scores involved retaining students in 9th grade so they would not sit for the TAAS exit exam in 10th grade. At its peak, 30% of 9th graders were retained for one or more years. Some students were kept in 9th grade for two years, and then skipped to 11th grade so they could avoid the exit test. When more students were retained, unsurprisingly, accountability ratings went up.

* While minuscule dropout rates were reported, only a third of students were graduating in Houston in 5 years or less.

Take home lessons? If it looks too good to be true, it probably is.

May 13, 2008

Roberta Flack, Vietnam, and NCLB - All in One Op-Ed

Roberta-Flack-Best-Of-2006.jpg
It's a deadly slow week in education policy, so I'll pass along this op-ed in the School Library Journal (Killing Me Softly: No Child Left Behind) on a teacher's decision to leave teaching because of the No Child Left Behind Act. Minus 5 points for the melodramatic beginning (I feel like the last marine who got out before the siege of Khe Sanh. I feel like the one Titanic band member who overslept, missed the voyage, and lived. In my darkest moments, I feel like a traitor.), but you can't hold that against a guy who writes young adult fiction. Here's an excerpt:

If you’re a teacher, thanks for being braver than I am. Thanks for riding it out when I’m just, well, riding out. And if you’re a parent, please fight for your child. Ask to see your school’s test-materials budget and its library budget. Ask to visit the classroom on a random day, unannounced. Ask whether your kid is getting more or less art than she would have had five years ago. Ask why band practice is at 7 a.m. when it used to be part of the school day. And while you’re mourning the loss of art, music, language, or history, ask the one most damning question of all: What took its place? If you get really riled up by the answer, please consider running for a spot on the school board.

As for me, I’m out. And I’m sorry.


Are teachers leaving because of NCLB? Does anyone have stories or data?

April 28, 2008

skoolboy says: Some of My Best Friends are Psychometricians, But...

spiffboy2.jpg
Deborah Meier added a comment to the end of the value-added thread from last week. (Thanks for stopping by eduwonkette's blog, Deb!) Her point is too important to overlook. She writes that standardized tests of reading proficiency are only loosely correlated with good reading habits—i.e., that a student can score well on a test of reading proficiency without demonstrating the habits of mind that could enable him or her to engage in a critical discussion of a text. Meier also writes that we do not have tests that measure "the more significant intellectually sound habits of heart and mind fundamental to being a well-educated member of society. The capacity to confront a phenomenon of interest in ways that help one best understand it, and then to make use of the knowledge acquired, is surely more important than being able to guess the one out of four 'best answer.'"

She's absolutely right, in my view. Preparing children and youth to be citizens in a democracy is a critical purpose of schooling. eduwonkette has written that there's a lot to schooling that can't possibly be measured by standardized tests – I think my favorite line is from the title of a post in January riffing on New York City's "Thank a Teacher" nomination process, "They Never Say, 'Thanks for Improving My Test Scores!'" – but it's easy to fall into the trap of treating the current testing regime as the natural order of things.

We need to be mindful that public schooling is now what institutional analysts such as Pat Burch call an organizational field, with lots of actors influencing our definitions of schooling and its outcomes, including textbook publishers, testing firms, test-prep firms, and a variety of other commercial entities. Lots of commercial enterprises and non-profits owe their livelihood to public education, and are engaged in an ongoing project to shape our definitions of "real school."

Testing is big business in the U.S. Non-profits such as the Educational Testing Service and ACT have annual gross revenues approaching $900 million and $400 million, respectively. ETS's K-12 testing operation had gross revenues of $172 million in their 2006 IRS filings. On the for-profit side, Pearson Education had gross revenues worldwide of $4.6 billion in 2006, with $600 million in adjusted operating profit. Their annual report crowed of a "healthy outlook in school testing underpinned by 2005 contract wins with a lifetime value of $700m (including Texas, Virginia, Michigan and Minnesota)." McGraw-Hill Education had revenues of $2.7 billion in 2007, with operating profit of $400 million.

With this much money, and more, at stake, you can bet that there are ongoing projects to define tests and testing as the appropriate way of defining what counts as good education. They tap into a logic that defines the modern world as increasingly rational, and society as a collection of individuals with increasingly differentiated roles, identities and personal preferences.

I'm not sure what the right approach is to counter all of this. At one point, I thought that giving politicians, educators and parents vivid representations of good teaching and good learning –e.g., videos, or portfolios--would be sufficient to persuade them that test scores don't come close to capturing what we aspire to in public education. But I haven't seen that strategy be successful. Preaching to the choir isn't going to do it – we need to find a way to put people in the pews. Readers, do you have ideas?

April 16, 2008

What Can Other Professions Teach Us about Evaluation and Accountability in Education?

In a very productive exchange, Dean Millot and Corey Bower have been contemplating the professional status of education. Dean's most recent post, "Why Legally Recognized Professionalism is Necessary to Reasonable Teacher Accountability," is one of the best think pieces I've read in some time. Read the whole thing, but here's the central theme of the post:

Lawyers and doctors are not punished for undesired outcomes; they are accountable for doing what professionals should do given their client’s circumstances....As a legally recognized profession, teacher conduct would be judged by teachers, according to standards of educational care devised by teachers, applied to the client circumstances in question.

Dean's post links well with AEFA conference talks by Randi Weingarten and Richard Rothstein last weekend. Weingarten also drew on the medical metaphor to argue that "teachers are physicians of the mind." In her view, there is a difference between the most skilled physician and a miracle worker. Just as the best hospitals can't solve public health crises on their own, Weingarten argued that, "schools cannot beat back all personal, social, and economic challenges that kids have." In an op-ed last week, she also endorsed a professional standard similar to that proposed by Dean:

[Teachers] should be assessed on how they use test scores and other data to adjust their teaching to help students improve....The approach is akin to judging doctors on how they use the results of blood tests, X-rays, and the like to prescribe a course of treatment.

In his talk, Rothstein drew on the experience of more fields than I can name (business, medicine, public works, etc). Despite many leaders' calls for education to mimic the private sector, Rothstein's review concluded that "private sector performance incentives rely primarily on subjective evaluations, not easily corrupted quantitative measurements." The central theme of the talk was that systems of measurement distort the processes they are intended to measure. The paper on which the talk is based - "Holding Accountability to Account: How Scholarship and Experience in Other Fields Inform Exploration of Performance Incentives in Education" - is a comparative/historical tour de force, and a must read if you're interested in the evaluation question.

Blog posts without positions generally fall on their face, but I still have more questions than answers about Dean's proposal. Here are the two questions I'm pondering:

* How do the processes of diagnosis, inference, and treatment in education differ from those in medicine and law, and what are the implications of these differences for "professional accountability?"

* How does the state of our knowledge about educational diagnosis and treatment differ from that in other professions?

April 7, 2008

Has "A Nation at Risk" Done More Harm Than Good?

nation%20at%20risk.jpg
Richard Rothstein bats first in a lineup of essays at Cato Unbound commemorating the 25th anniversary of "A Nation At Risk," and asserts that the report has done more harm than good.

Why? First, Rothstein argues, the report wrongly concluded that student achievement was declining. The report mistook the changing composition of SAT test takers for a half a standard deviation decline in SAT scores since the 1960s. Second, Risk placed the blame on schools for national economic problems over which schools have relatively little influence. While education surely plays a part in economic growth, he shows that our economic vicissitudes are driven by factors much larger and more complex. Third, he writes, Risk ignored the responsibility of the nation’s other social and economic institutions for learning. Rothstein concludes:

A Nation at Risk was well-intentioned, but based on flawed analyses, at least some of which should have been known to the Commission that authored it. The report burned into Americans’ consciousness a conviction that, evidence notwithstanding, our schools are failures, and a warped view of the relationship between schools and economic well-being. It distracted education policymakers from insisting that our political, economic, and social institutions also have a responsibility to prepare children to be ready to learn when they attend school.

I'm looking forward to this exchange, as I've never squared away in my mind whether A Nation At Risk was a report that spurred a movement, or a movement that engineered a legitimizing report. Michael Strong, Sol Stern, and Rick Hess will also weigh in this week.

April 3, 2008

Do High School Exit Exams Pay Off in the Labor Market?

exit.jpg
High school exit exams have become a common fixture in American high school life. By 2006, 22 states had exit exams - and because larger states are more likely to have exams, approximately two-thirds of all high school students face exit exam requirements.

Proponents of exit exams often assert that these tests make the high school diploma more meaningful to employers. If this is the case, these policies should widen the gap in earnings and labor market outcomes between those who earn high school diplomas and those that don't. Despite the popularity of these policies, few papers have examined this claim empirically.

In "State High School Exit Examinations and Postsecondary Labor Market Outcomes," published in the most recent edition of Sociology of Education, Rob Warren, Eric Grodsky, and Jennifer Lee take up this question. Analyzing data from both the Census and the Current Population Survey, they found no evidence that state exit exams positively affect labor force status or earnings. Furthermore, they found no evidence that the effects of these policies vary by race or ethnicity, or by the level of difficulty of the exit exam.

In short, exit exams do nothing to increase the labor market value of the high school diploma. At the same time, other evidence suggests that exit exams (especially more difficult ones) are associated with lower public high school completion rates and higher rates of General Educational Development test taking (see Warren et al., High School Exit Examinations and State Level Completion and GED Rates, 1975-2002). Others find that exit exams increase inequality in rates of high school completion, and especially influence African-American students' odds of completing high school. (See Dee and Jacob, Do High School Exit Exams Influence Educational Attainment or Labor Market Performance?)

Of course, it is possible that exit exams help improve the quality of education in lower grades, though I've seen little evidence on this point. Readers, what do you think? Do exit exams hurt more than they help?

March 25, 2008

Got NAEP?

Great opportunity to ask National Center for Education Statistics Associate Commissioner Peggy Carr questions about the NAEP. At 2 p.m. on April 3, you can join her for an online StatChat about the 2007 writing assessment results. Submit questions for the chat anytime in advance here and pop in on the 3rd for the session.

March 22, 2008

Madame Secretary Demands Triage, Randy Reback Delivers

spellings.jpg
"We need triage," Madame Secretary explained last week. This morning, Randy Reback delivered it to my inbox via the Journal of Public Economics' new issue, which includes his paper, "Teaching to the Rating: School Accountability and the Distribution of Student Achievement." Reback analyzed data from Texas, the birthplace of NCLB-style accountability, and here's what he found:

* Schools respond to math performance incentives both by targeting math resources towards specific students and by making broad changes which also help very low achieving students. These responses tend to sacrifice the targeted students’ reading performance and to sacrifice relatively high achieving students’ performance in both math and reading.

* Schools respond to reading performance incentives by targeting resources towards the reading performance of particular students, sacrificing these students’ math performance and sacrificing all other students’ performance in reading.

* Finally, schools devote fewer resources towards students in the terminal grades during years when short-run incentives are low than during years when incentives are high.

Reback concluded:

Whether the finding of non-trivial distributional effects is a positive or negative outcome of this public policy is entirely subjective. If one of the primary goals is to create a sort of educational triage, in which students below minimum grade-level skills are pushed up, then the No Child Left Behind type of accountability system appears to be fairly effective. Furthermore, the results say nothing about the overall impact of this system on performance: it may be a rising tide that lifts all boats (and lifting some more than others), or it may be a falling tide sinking all boats (and sinking some less than others).

The important lesson here is that schools respond to the specific instructional incentives created by the accountability system. Schools' responses include targeting specific students, targeting specific subjects, and making broad changes which affect all students. An accountability system should only create disproportionate incentives concerning student achievement gains if the intention is to help some students more than others and to boost performance in some subjects by more than others. Otherwise, the optimal accountability system requires a more evenhanded approach.

March 20, 2008

Improving Graduation Rates: The Push Out/Pull In Dilemma

SafetyShort_PullPush.sized.jpg
Today's NYT article on graduation rates touches briefly on the push out problem. But there's another approach to improving grad rates that has run rampant in NYC - awarding credit even after students fail courses. Seat time credit has received some play (see these old posts from Edwize and NYC Educator), but there's an important story waiting to be written about how schools have changed failing course grades if students attended tutoring or completed independent projects.

None of these tactics is necessarily problematic from an educational standpoint. In fact, offering multiple chances may be an important way to keep a reluctant and at-risk population attached to school. But they should challenge how we view changes in the graduation rate in NYC. It's also an awkward juxtaposition with test-based grade retention in grades 3, 5, 7, and now 8.

On the pushout issue, take a look at this recent paper by Linda McNeil and colleagues, "Avoidable Losses: High-Stakes Accountability and the Dropout Crisis." The quantitative part of the study doesn't do a good job of separating the portion of the dropout problem attributable to high-stakes accountability from that which predated accountability. Nonetheless, the qualitative section has some gems about the tradeoffs principals face when they are asked to increase test scores and graduation rates simultaneously, i.e. one principal said:

It’s not a miracle to manipulate things. A miracle is saving kids actually, in reality—that’s what miracles are. To go out and get these kids who were dropped out, or to get kids who are not achieving and find ways. That’s a miracle to get all of it to do that. It’s not to manipulate things so that it appears—it’s a facade.

March 18, 2008

Before NCAA Divisions, We Need Better Data

Yesterday, Alexander Russo applied the concept of NCAA divisions to the comparison group debate. He suggested:

What about creating NCAA-like divisions (I, II, III) within public school systems based on student poverty, in order to help someone (educators) get past the poverty- achievement trap and help others (politicos) see that performance varies even with schools with similar demographics?

The trouble is that public schools only have access to blunt measures of students' socioeconomic status and other non-school conditions. In particular, free and reduced lunch eligibility poorly captures degrees of disadvantage. Imagine two schools in which 60% of students qualify for free lunch. In one school, free lunch qualifiers are from families making 95% of the poverty line; in the second school, these kids are from families earning 50% of the poverty line. With currently available data, we falsely make apples-to-apples comparisons between these schools. By the same token, a school full of poor graduate students' kids can look a lot like one with kids facing multigenerational poverty if we only consider free/reduced lunch measures.

If we want to construct accurate comparison groups, we need to collect additional data on parental education, income, family structure, etc. A massive data collection effort isn't in the stars, though. So when we read sentences like, "School 1's share of students from low-income households is identical to that of School 2, so differences in test scores cannot be attributed to poverty," we should, at the very least, take a closer look. (See these related posts on the no excuses argument or NYC's peer groups).

March 17, 2008

Charlie Barone and I Agree!

Charlie.jpg
An event so rare that it deserves its own blog post: Charlie points to a Washington Post article on NCLB and students with disabilities. The article argues that NCLB has forced schools to focus on disabled students because their scores are separately disaggregated and only a small fraction of students can be exempted. Before NCLB, too many state accountability systems had gaping loopholes that allowed these students to be ignored (for more, see here).

Of course, this brings us back to the NCLB incentives debate. If we credit the structure of the law when students with disabilities receive more attention, shouldn't we look at the structure of the law when schools emphasize tested subjects? These are questions better answered by someone with a completed AERA paper...

February 29, 2008

Nip/Tuck for NYC Progress Reports?

nip-logo.jpg
Yesterday's Principals Weekly (a weekly email sent to New York City principals) foreshadowed some possible changes to the NYC Progress Reports. (You can read earlier posts on progress reports here.) Some proposed changes include:

1) The new system may assign separate grades for each element of the progress report. In other words, schools could get an A for the overall proficiency category, a C based on their students' test score growth, and an F based on the learning environment surveys. This is a very positive step. (Diane Ravitch made a powerful argument for this change in the fall.)

2) The Progress Reports compare each school to a group of similar schools. In the fall, the elementary and K-8 "peer indices" were created using demographics; the new proposal is to use "the average ELA and math proficiency rating of students in the testing grades" instead.

3) To address ceiling effects, the new Progress Reports may count any level 4 student (the highest performance level) who remains at level 4 as making one year of progress.

4) A "progress adjustment" may be made for special education students who take the state ELA and math tests in consecutive years. I am not sure how DOE plans to adjust scores, but this appears to be a response to Leo Casey's special ed post on Edwize.

Read the full Principals Weekly excerpt on Progress Reports below, or see Elizabeth Green for more details.

Continue reading "Nip/Tuck for NYC Progress Reports?" »

February 8, 2008

Do Quality Reviews Lead to Increased Student Achievement?

spiffboy2.jpg
skoolboy wraps up his posts on Quality Reviews. His first two posts can be found here and here.

Do quality reviews lead to increased student achievement? There’s been surprisingly little research that addresses this question. Most research on quality reviews has examined the school inspection process in Great Britain managed by the Office for Standards in Education (Ofsted), a national agency which reports to the Parliament. Since school inspections for primary and secondary schools were instituted in 1993, there have been several iterations in the school inspection process. But I haven’t found any persuasive evidence that inspections improve student achievement. Some teachers and administrators report that they intend to change their practices in response to the inspection report, but I’ve not seen studies which examine whether those intentions translate into improved practice.

You might get the impression from my postings this week that I think that quality reviews are a bad idea. Not necessarily! But there are some things that I think are essential for quality reviews to be a good idea. Here’s a brief list:

The purpose of the review must be clear. Sociologist Gary Natriello has written about four potential purposes for evaluations in schools: motivation, direction, certification and selection. The first two can contribute to school improvement, whereas the latter two are more concerned with regulation, accountability, and control; and it’s desirable to confront the tensions between improvement and control directly. If the purpose of a quality review is to improve how schools work, then all phases of the review process need to be oriented towards this purpose.

Definitions of quality must be clear and transparent. If there are clear criteria and standards for what constitutes school quality, then both educators and inspectors can orient their activities towards these criteria and standards. Unclear standards and definitions undermine the legitimacy of the quality review process. My impression is that the Ofsted criteria are a lot clearer than those that I’ve seen stateside. Quality teaching is a particularly challenging phenomenon to articulate; but if the goal is to improve teaching, we’ve got to be able to do it.

The quality review process must be designed to collect a sufficient amount of data on quality. If, for example, the purpose of the quality review is to improve teaching, then presumably there should be sustained collection of data on teaching quality, primarily through direct observation, but perhaps in other ways as well. Ms. Frizzle recently commented that in her New York City school, the quality reviewer was planning to observe 9 different classrooms in 30 minutes. Not much data on teaching quality will come from such a process. The intensity of data collection is a recurring challenge in evaluation research that involves site visits, because they are labor-intensive. “Drive-by” site-visits just aren’t very useful, even if conducted by well-trained observers, because they don’t gather enough data on the things that matter.

The frequency of quality reviews should be synchronized with a theory of how fast school quality is changing. This is Social Research 101: phenomena that change more quickly need to be measured more frequently to detect such changes, and phenomena that change more slowly don’t need to be measured as often. How frequently should we assess school quality? The school year is an arbitrary metric, and it may be wasteful and counterproductive to conduct school quality reviews on an annual basis. (In Great Britain, Ofsted inspects primary schools every three years.) Given a choice, I’d rather have less frequent, but more intensive, quality reviews.

February 4, 2008

Reviewing External Quality Reviews, or: Consultant Whack-a-Mole!

spiffboy2.jpg
I teach at a college that periodically commissions external reviews of the institution and its academic programs. Sometimes these external institutional reviews are "high stakes," such as regional accreditation reviews (e.g., North Central Association, Middle States, etc.) or professional accreditation reviews (such as the National Council for the Accreditation of Teacher Education). Out of the corner of my eye, I've been seeing an increase in the reliance of large urban school districts, such as New York City and Washington, DC, on external reviews (sometimes labeled "quality reviews.") I'm intrigued by the similarities and differences I'm observing.

Most external reviews begin with a self-study, which typically has three major dimensions: (a) What are your unit's goals? (b) How well are you meeting these goals, and what's the evidence? (c) What are you going to do about it? This is then followed by the proverbial "site visit," in which an individual or team from outside of the institution reviews the self-study, comes to the campus for a day or two, pokes around and asks questions, and retreats to write a report which is shared with the institution and its leaders. Often, the institution then will write a response to the report. Then the report goes on the shelf.

The composition of the site visit team can arouse some passion. In postsecondary institutions, site visitors typically are conceived of as peers of the faculty; but who counts as a peer is a matter of debate. How can someone from Eastern Podunk College ever understand how we at Elite University do business? Is a site visitor who studies 18th-century English literature really a peer of the faculty in an English department that focuses on contemporary American fiction?

I'm intrigued by the fact that in New York City and Washington, DC, the site visitors are external management consultants who are not educators within the system, and in fact may not be teachers or administrators in other systems. Consultants such as these would be laughed out of the room in a review of a college department; but nobody's laughing in large urban districts. I think this is because college faculty are assumed to have stronger claims to disciplinary knowledge and expertise than do K-12 teachers and administrators, and because the shared governance model in colleges and universities give faculty more control over academic decision-making than K-12 educators are typically granted.

Scholars of organizations make sense of external reviews by drawing on institutional theory. Institutional theory focuses on the relationship between organizations and their external environments, including the ways in which organizations are perceived to be legitimate by their external environments. An organization (e.g., school, district, or college) that is perceived to be high-performing generally doesn't have to worry about its legitimacy. But many educational organizations are not seen as high performers. In this case, they have to rely on some other way to be seen as legitimate than a demonstration of good outcomes. A common strategy is to imitate the practices of other social institutions that are seen as legitimate, in the hopes that the legitimacy will "rub off."

Many cases of education imitating the business world can be explained in this way. (Not that the business world has such a great track record to warrant serving as the ideal standard.) So, for example, because it's seen as rational for organizations to set goals and measure progress towards them, this is an integral part of most external review processes-much more so than direct inspection of what the organization is actually doing to meet those goals. This would account for the use of management consultants as external reviewers in New York City and Washington. In this sense, external reviews are mostly symbolic, rather than substantive.

This is, of course, a highly cynical view of external reviews-perhaps more than is warranted. I'd like to pose a couple of questions to eduwonkette's readers: (1) What are some legitimate purposes of external reviews of K-12 schools? (2) Based on these purposes, what should the composition of an external review team look like? The purpose in asking these questions is not to play whack-a-mole with consultants (although that may be a consequence), but rather to introduce a topic that I hope to post a bit more about over the next couple of days. I'm also curious if readers know of any evidence of external reviews actually improving teaching and learning in K-12 schools. Please feel free to e-mail me at skoolboy2 (at) gmail (dot) com to point me in a fruitful direction.
The opinions expressed in eduwonkette are strictly those of the author and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.

Get RSS

Get eduwonkette delivered by e-mail. Enter your e-mail here:

Delivered by FeedBurner

Advertisement
Powered by
Movable Type 3.34
<

EW Archive