« How to Subscribe to eduwonkette | Main | Guest Blogger Mica Pollock on: Everyday Antiracism: Getting Real About Race in School »

The Trouble with the Education Policy Advocacy Industry: "Building on the Basics"

Today, Marcus Winters, Jay Greene, and Julie Trivitt are releasing a study called, "Building on the Basics: The Impact of High-Stakes Testing on Student Proficiency in Low-Stakes Subjects."

It may be an elegantly executed study, or it may be a terrible study. The trouble is that based on the embargoed version released to the press, on which many a news article will appear today, it's impossible to tell. There is a technical appendix, but that wasn't provided up front to the press with the glossy embargoed study. Though the embargo has been lifted now and the report is publicly available, the technical appendix is not.

By the time the study's main findings already have been widely disseminated, some sucker with expertise in regression discontinuity may find a mistake while combing through that appendix, one that could alter the results of the study. But the news cycle will have moved on by then. Good luck interesting a reporter in that story. And even when researchers working in the policy advocacy industry make sloppy, indefensible errors - for example, when Greene and Winters used data that the Bureau of Labor Statistics warned against using to show that teachers are overpaid - they're not approached with caution by the press when the next report rolls around.

So as much as I like to kvetch about peer review and the pain and suffering it inflicts, it makes educational research better. It catches many problems and errors before studies go prime time, even if it doesn't always work perfectly.

As for the Winters, Greene, and Truitt study, the jury is still out - as it should be until we have more information. I'll get back to you once I've read the technical appendix.


Let me first apologize to Jay Greene and my readers for shooting off a short post before teasing out all of the complexities around thinktanks, research, and the reporting of research in the popular media. I used Greene's paper as a vehicle for doing so, and that may have made it appear that I was criticizing the quality of that study when I was not in a position to do so. I shouldn’t have raised questions, even hypothetical ones, about the methods in that paper until the technical report was available for review, and you should definitely read Greene's response here.

This issue, however, is much larger than this particular Manhattan Institute report, and I want to use Greene's critique - that I have posted on working papers from the National Bureau of Economic Research (NBER) - to point out some important differences between papers issued by outlets like NBER and thinktank reports:

1) With NBER papers, everything is on the table upfront. They are scholarly papers that include extensive methods sections and robustness checks in every paper. Greene writes that, “If [reporters] requested the technical report, they could get that.” But the press release makes no mention of a technical report at all. The key difference is that there’s an extra step in the process to get to the detailed methods, which reporters writing articles could ostensibly circulate to other scholars for comment before writing an article.

2) There is no PR machine behind NBER papers. It’s one thing for me to write about a study on my blog. It’s entirely another to send press releases to reporters at newspapers and other media outlets, who in turn – and this is their fault, not Greene’s – cover his report like it’s a final product. The more complex the methods are, the more there is a need for peer review because it becomes more difficult to eyeball the problems from the sidelines – and Greene and his colleagues are using sophisticated methods in this report.

3) NBER papers generally aren’t trying to persuade anybody in particular of anything. They are not intended to sway public policy. Contrast this with the press release approach of policy advocacy thinktanks. For example, the press release for this study said, "In this report, Winters, Greene, and Trivitt dispel the myth that high stakes testing in reading and math will harm student proficiency in low-stakes subjects. The data from Florida provides further evidence for policy makers considering the renewal of No Child Left Behind, showing that national testing incentives improve overall educational achievement levels.”

4) NBER has implicit quality controls. It is a community of scholars to which one must be invited, one that has strong norms about how research is conducted and reported. The quality of the average NBER working paper is extremely high. There is much less variation in the quality of NBER papers than there is in thinktank reports. On some level, this is an issue of the trustworthiness of institutions; for example, I trust a report coming out of RAND or Mathematica more than I do one coming out of the Heritage Foundation, because neither RAND nor Mathematica have a stated ideological agenda.

For the best treatment of the thinktank issue I’ve ever seen, see this post by Dean Millot, and his preceding posts here and here.

This skepticism is a good thing.

But where was it, for example, a month ago when you (gleefully) reported the findings of Fordham's study on how high achievers have fared under No Child Left Behind?

That study seems to have the same potential infirmities as the current one, yet you reported that study's findings without today's warning.

I'm thinking the level of scrutiny shouldn't depend on whether a reporter agrees or disagrees with the study's findings.

Hi Ken,

Good question. The two studies are very different in their complexity. The Fordham study was descriptive and it was very clear what the authors had done with their data - they compared the 10th and 90th percentiles of NAEP scores from 2000-2007, while also supplementing analysis from this time frame by comparing states with and without accountability systems in the 1990s. It didn't claim causal effects, and the analysis of test score data was further bolstered by a survey of teachers, which provided insight into the mechanisms through which their observed effects might operate.

A regression discontinuity design like that used by Winters and Greene can only be used to make causal inferences when a number of assupmtions are satisfied. In their report, we are told that the approach is like that of Rouse et al (2007) - an elegant paper using the same Florida data (which btw comes to a pro-accountability conclusion, conclusions I fully buy because of how well-done that paper is) - but it's not clear if Winters and Greene's approach is the same as Rouse et al's from the report. The empirical support for the mechanism that they suggest to explain their findings makes little sense without further detail.

In sum, unlike the Winters/Greene study, there are no technical assumptions required by the Fordham analysis, no case selection issues (the Rouse et al. and Winters/Greene papers seem to have different sample sizes, though it's not clear why), no need to worry about controlling for prior achievement (which Winters and Greene can't do) - it was an analysis that was very straightforward, and all of the relevant details were there in the main report.

Yes, e, I understand all that and agree.

But, did Fordham release that teacher survey data? I don't believe they did, so you have to take their word that their reporting of the data and the survey questions were done properly. And this is often the case with ed researh. Rarely are datasets made available.

So, the underlying problem remains. That problem is that most journalists don't understand any of that stuff and will go ahead and report (and often spin) findings of these studies without first determining if the findings are sound. The initial stories get printed aas truth and subsequently found iinfirmities are rarely reported.

I, too, agree that data should be made publicly available when possible. In the case of Fordham, they could have made the survey data available since there weren't privacy issues involved, as there are with the Florida data.

And you are also right that we need to put some of the blame for this phenomenon on journalists. Do you have thoughts about how they could deal with better reporting on research? We're never going to have people with experience doing this kind of work reporting - nor should we, as they are very different skill sets - so perhaps papers need to employ researchers on a consulting basis who can help reporters vet these studies. It would be an enormous public service if scholars would do this on a voluntary basis.

There are other issues with this study in addition to the good questions from Eduwonkette and her respondents. In particular, whenever we look at these high-stakes state tests, we have to ask what kinds of learning are being gained, what lost - not only in relation to subjects such as science, but to the ability to think. The media here tends to assume that test scores equal learning, and far too rarely asks what kinds of learning for what ends.

Assume Greene et al are correct, that the high stakes basic skills testing in reading and math induces a modest gain in test scores on the low-stakes science test in the year following a school obtaining an "F" grade from the state. The authors propose some explanations, but there are others.

For example, does the intense drill in answering multiple choice questions in reading and math that seems to almost inevitably accompany high stakes tests, esp failure on those tests, produce an increased ability to answer multiple-choice questions in science? Reading m-c questions and sorting answer options has been studied as a particular sort of reading skill, of questionable use beyond answering m-c items. But it could carry over to other subjects.

More generally, does an increase in the score indicate much in the way of increased knowledge or that the knowledge will be retained? Will there be future gains, or is this a one-shot experience (which of course can only be answered later)? Will it produce a better science foundation for future learning, or merely a few more fragments of learning that do not constitute a foundation? The test-critical literature the authors refer to, along with a great many more studies, tell us that the over-emphasis on testable so-called 'basic skills' in reading and math is not just crowding out teaching time in other subjects, it is crowding out time for thinking, problem solving, discussion, writing more extensive than the 'five paragraph' response to a prompt, etc.

In short, there is a seemingly endless circuitous 'logic' around the testing of basic skills that avoids deeper questions than whether a heavy focus on 'basic skills' (test prep?) reading and math can get students to answer an extra question or two on a mostly rote-memorization science test.

I think what journalists should be doing is finding someone knowledgable and credible on the opposite side of the issue presented by a particular study and have that person provide specific criticism on the study since that person has an incentive to do so. Then the study's authors can provide a rebuttal.

This is how our legal system works. The judge, or in this case the reader, can decide who made the best case.

The problem you run into, however, is when your go to guy is someone like Fair Test's Monty Neill whose criticism of the study involves a bunch of meta-tangents rather tha direct criticism. In order to properly rebut all his assertions you'd need to point out how each of his concerns is readily refuted by settled cognitive science data which, while amusing in and of itself, doesn't get us any closer to understanding the soundness of the underlying study. What we get is an endless re-argument of fairtest's talking points.

"Students in Florida were also administered a standardized exam in science, but this test was low-stakes because its results held no consequences under the A+ program or any other formal accountability policy."

A paper by Goldhaber and others conducted in FL found that the public opinion sanctions often outweigh the formal accountability mechanisms in place. So, I think one of their assumptions may be incorrect.

"There is some evidence to suggest that student science proficiency increased primarily because student learning in math and reading enabled that increase. That is, learning in math and reading appear to contribute to learning in science."

If this is true, how can individual teacher value-added be accurately measured without taking into account the effect of the other teachers on a student's achievement? In fact, wouldn't this finding support the notion of school-based rather than teacher-based rewards?

I think eduwonkette has touched upon an important issue, and that is, how research first reaches the public. The study in this case was embargoed until the day it was released, like any news story. What typically happens is that the authors write a press release that contains findings, and journalists write about the press release. Not many journalists have the technical skill to probe behind the press release and to seek access to technical data.
When research findings are released like news stories, it is impossible to find experts to react or offer "the other side," because other experts will not have seen the study and not have had an opportunity to review the data.

Diane Ravitch

I appreciate Eduwonkette's apology and was not offended by what she wrote. I think we are making progress in our discussion, but I still have some bones to pick, which are in this new post:

Thanks for the excellent rebuttal against Greene's "bone" picking. It was wholly a pleasure to read Dean Milot's piece on think tanks.

One of the key points is that think tanks are more focused on policy and ideological arguments. I think this is why Greene cries "Foul" and says one mustn't question motives.

Seems rather hypocritical coming from someone who writes many pieces questioning the motives of other agencies - "myths" as he so often prefers to call them.

Here are some questions/comments an anonymous academic reviewer might have. This is what a real peer review would look like. This is why academics feel the process is critical, policy wonks think it is a waste of time, and even good reporters could never be expected to do this themselves. At least two other professionals would write a similar review. A good editor would compile the results, especially noting shared criticisms and discarding trivial suggestions or simply wrong observations, determine which faults must be addressed if the paper is going to be published at all, and pass the reviews on to the researchers. This is usually an anonymous process.


This is a topic of timely and critical importance. The authors use an accepted statistical approach to analyze two years of FCAT data. No new methodology is proposed. New findings are presented. The authors consider the effects of an accountability system on achievement in two subjects within the accountability system (Reading and Math) and one subject not included in accountability ratings (Science) using the previous year as a control. The paper is generally well-written, but suffers from several significant weaknesses that strongly affect its validity. Numerous graphs and tables of available data that would greatly increase understanding of the study are omitted. The authors do not take a strict approach to the reporting of statistical significance or generalization from the null hypothesis.

1. This study does not actually compare gains on an FCAT Science test. It compares real Science performance on a grade 5 test with a predicted Science score based on Reading and Math test data. Reading and Math do correlate with Science, reflected in the final model reported (r-squared = .7371), but are they strong enough to estimate a precise measure for individual students? A graph of the real versus predicted scores (or residuals) for students using the observed 2005 data would be helpful to see the degree of error. A quarter (> 25%) of the variance in Science is not captured by the Reading/Math model. It could be random variance. It could be variance due to differences in the teaching of Science in response to the accountability model. If you are testing whether Reading and Math gains are independent of Science gains, using Reading and Math to predict one of the two Science scores in your comparison is a significant flaw in the design.

2. This study compares data from 2002 and 2003 tests and considers the impact of accountability "F" on performance in the 2003 year using 2002 as a control. However, several factors compromise this as an ideal time period for this study. First, as the authors state, a new version of the accountability ratings were suddenly implemented in the SUMMER of 2002 to the "surprise" of administrators. What new school reform could these administrators have reasonably been expected to implement to cause a change in achievement by the end of 2003 -- for any subject, let alone Science? The study does not include any evidence of any actual action taken by administrators. Second, the Science test that is the dependent variable in the 2003 data was being given for the first time. That makes the prediction of a Science test (see 1 above) using Reading and Math particularly problematic as you have to predict the performance of those students on a Science test before that test actually existed. So, the gains in Science were highly theoretical: simulated "proxy" gains on a test that not only was not administered the grade before, but did not even exist in the current grade the year before. And the gains are suggested to be in response to a new accountability rating system surprised administrators did not have time to address in a meaningful way. This is a significant constraint on the validity of this study. The authors are to be commended for reporting it in the introduction, but fail to address these timelime-related issues in the discussion of their results.

3. Why were more recent data not used? If the findings here are indeed reliable, don't 2004, 2005, 2006, and 2007 offer 4 possible opportunities for a straight replication? There is no need to modify any statistical program, just re-run the new data as is, assuming the data are still collected. Given the authors' goal to generalize to NCLB, it would be more appropriate to use FCAT data that were subject to the actual sanctions of NCLB when the law took effect.

4. The authors convert all tests in the study to z-scores (mean of 0, SD of 1). What was done to verify that the test scores were normally distributed? If they were not normally distributed, this conversion would distort the underlying data. If the tests were not each normally distributed, the authors should explain the degree to which this conversion biased their results. Graphs and frequency distributions of underlying scales and gains should be included in this report.

5. Many variables such as free lunch, race, and gender are used in the models, but the coefficients are not reported. Why? If the authors consider them important enough to be included in the models, the results should be reported. If they had no effect in models, they should not be included in the models.

6. Given the controversy over the word "effect" in educational studies, calling the corresponding rise in one content area score along with another a "correlation effect" should be reconsidered. Merging "correlation" and "effect" into a single phrase for this purpose is going to lead to endless confusion and criticism.

7. The authors need to be consistent in identifying effects and describing them. Three levels of statistical significance are used, including the less strict p

8. The authors need to take great care not to generalize from lack of statistical significance. Failure to reject the null is taken as evidence of a fact. This is a problem throughout the paper. Although the overall sample size (the statewide population, really) is large, the lack of statistical significance can be affected by the distribution of the variable in question. For example, in the table considering the effects of Math, the coefficient for a C-rating was .022 and was statistically significant at p

9. Why are there 151,604 students reported in the model for Science gains (observed - simulated), but only 150,458 students reported in the model for the observed data? How could there be fewer students with real scores than students with real and proxy scores? The authors should explain this in a footnote in the table.

10. The authors comment that the Florida data are better than Chicago data as there have been no claims concerning systematic manipulation of results. The authors should take care to mention the statewide investigation of 3rd grade FCAT data initiated in 2007 to see whether strong Reading gains posted by 3rd graders in 2006 were valid. That happened after these data were collected, but the authors may want to steer clear of blanket claims as to the undisputed integrity of any testing program, unless they are willing to stake their professional reputations on it.

Rather than address many of the concerns brought forth by Eduwonkette or by Anonymous Peer Review's excellent post-mortem of Greene's report, Greene has chosen to set up a paper tiger and attack it.

In other words, Greene is questioning Eduwonkette's right to criticize his work rather than address the multiple problems and issues brought out by several contributors. This seems to indicate that Greene is evading a substantial discussion of his paper's perceived weaknesses.

Validity and credibility are important to research and it makes a strong case for the importance of peer-review. In proclaiming a desire for higher standards for public education, it seems contradictory to appeal to lower standards for his own work.

I've got a technical question about studies like this...

If you look at schools labeled "Failing," you're taking a snapshot of a somewhat noisy process (assessing schools on the basis of a test) where some schools have good years and some schools have bad years, and the schools having bad years (in the random sense) will be at least slightly over-represented in the "Failing" group (just as ball players having a bad year will be over represented in the group of players with batting averages under .200).

So if you look at their results the next year, you expect more positive movement than negative movement, just from the noise. Does the statistical analysis in the study account for this?

I don't know enough about social science statistical methods to know what's built into the modeling, and what isn't.

Hi Rachel, This problem is formally referred to as "regression to the mean," and studies like this one almost always perform a series of checks to be sure that this is not driving their results because it certainly can. I still haven't had the chance to read the technical appendix or watch the video of the AEI event where this work was presented this week, so I am not sure if this study does check for regression to the mean.

Comments are now closed for this post.


Recent Comments

  • eduwonkette: Hi Rachel, This problem is formally referred to as "regression read more
  • Rachel: I've got a technical question about studies like this... If read more
  • Elton: Rather than address many of the concerns brought forth by read more
  • Jay P. Greene: I continue my comments here: http://jaypgreene.com/2008/07/10/eduresponses-to-edubloggers/ read more
  • Anonymous Peer Review (1 of 3): Here are some questions/comments an anonymous academic reviewer might have. read more




Technorati search

» Blogs that link here


8th grade retention
Fordham Foundation
The New Teacher Project
Tim Daly
absent teacher reserve
absent teacher reserve

accountability in Texas
accountability systems in education
achievement gap
achievement gap in New York City
acting white
AERA annual meetings
AERA conference
Alexander Russo
Algebra II
American Association of University Women
American Education Research Associatio
American Education Research Association
American Educational Research Journal
American Federation of Teachers
Andrew Ho
Art Siebens
Baltimore City Public Schools
Barack Obama
Bill Ayers
black-white achievement gap
books on educational research
boy crisis
brain-based education
Brian Jacob
bubble kids
Building on the Basics
Cambridge Education
carnival of education
Caroline Hoxby
Caroline Hoxby charter schools
cell phone plan
charter schools
Checker Finn
Chicago shooting
Chicago violence
Chris Cerf
class size
Coby Loup
college access
cool people you should know
credit recovery
curriculum narrowing
Dan Willingham
data driven
data-driven decision making
data-driven decision-making
David Cantor
Dean Millot
demographics of schoolchildren
Department of Assessment and Accountability
Department of Education budget
Diplomas Count
disadvantages of elite education
do schools matter
Doug Ready
Doug Staiger
dropout factories
dropout rate
education books
education policy
education policy thinktanks
educational equity
educational research
educational triage
effects of neighborhoods on education
effects of No Child Left Behind
effects of schools
effects of Teach for America
elite education
Everyday Antiracism
excessed teachers
exit exams
experienced teachers
Fordham and Ogbu
Fordham Foundation
Frederick Douglass High School
Gates Foundation
gender and education
gender and math
gender and science and mathematics
gifted and talented
gifted and talented admissions
gifted and talented program
gifted and talented programs in New York City
girls and math
good schools
graduate student union
graduation rate
graduation rates
guns in Chicago
health benefits for teachers
High Achievers
high school
high school dropouts
high school exit exams
high school graduates
high school graduation rate
high-stakes testing
high-stakes tests and science
higher ed
higher education
highly effective teachers
Houston Independent School District
how to choose a school
incentives in education
Institute for Education Sciences
is teaching a profession?
is the No Child Left Behind Act working
Jay Greene
Jim Liebman
Joel Klein
John Merrow
Jonah Rockoff
Kevin Carey
KIPP and boys
KIPP and gender
Lake Woebegon
Lars Lefgren
leaving teaching
Leonard Sax
Liam Julian

Marcus Winters
math achievement for girls
meaning of high school diploma
Mica Pollock
Michael Bloomberg
Michelle Rhee
Michelle Rhee teacher contract
Mike Bloomberg
Mike Klonsky
Mike Petrilli
narrowing the curriculum
National Center for Education Statistics Condition of Education
new teachers
New York City
New York City bonuses for principals
New York City budget
New York City budget cuts
New York City Budget cuts
New York City Department of Education
New York City Department of Education Truth Squad
New York City ELA and Math Results 2008
New York City gifted and talented
New York City Progress Report
New York City Quality Review
New York City school budget cuts
New York City school closing
New York City schools
New York City small schools
New York City social promotion
New York City teacher experiment
New York City teacher salaries
New York City teacher tenure
New York City Test scores 2008
New York City value-added
New York State ELA and Math 2008
New York State ELA and Math Results 2008
New York State ELA and Math Scores 2008
New York State ELA Exam
New York state ELA test
New York State Test scores
No Child Left Behind
No Child Left Behind Act
passing rates
picking a school
press office
principal bonuses
proficiency scores
push outs
qualitative educational research
qualitative research in education
quitting teaching
race and education
racial segregation in schools
Randall Reback
Randi Weingarten
Randy Reback
recovering credits in high school
Rick Hess
Robert Balfanz
Robert Pondiscio
Roland Fryer
Russ Whitehurst
Sarah Reckhow
school budget cuts in New York City
school choice
school effects
school integration
single sex education
small schools
small schools in New York City
social justice teaching
Sol Stern
Stefanie DeLuca
stereotype threat
talented and gifted
talking about race
talking about race in schools
Teach for America
teacher effectiveness
teacher effects
teacher quailty
teacher quality
teacher tenure
teachers and obesity
Teachers College
teachers versus doctors
teaching as career
teaching for social justice
teaching profession
test score inflation
test scores
test scores in New York City
testing and accountability
Texas accountability
The No Child Left Behind Act
The Persistence of Teacher-Induced Learning Gains
thinktanks in educational research
Thomas B. Fordham Foundation
Tom Kane
University of Iowa
Urban Institute study of Teach for America
Urban Institute Teach for America
value-added assessment
Wendy Kopp
women and graduate school science and engineering
women and science
women in math and science
Woodrow Wilson High School