« 'Highly Qualified' Principals? | Main | Rethinking Accountability »

A Fresh Start for 'What Works'?


In a recent Education Week commentary, Robert E. Slavin writes that the 5-year-old What Works Clearinghouse, designed to provide educators with a central and trusted source of evidence on what works in education reform, has failed to produce information that is scientifically justified or useful.

But the clearinghouse may get a fresh start now that the Mathematica Policy Research Inc. has won the contract to operate the federal endeavor, which is the flagship of the Institute of Education Sciences, argues Mr. Slavin, a co-director of the Center for Research on the Education of Students Placed at Risk, at Johns Hopkins University in Baltimore.

What do you think? How can the What Works Clearinghouse provide practitioners with more useful information about education research?


Slavin is right. The solution to making the Clearinghouse useful is to establish clear and rigorous standards for what can be deemed effective. So long as programs can be accepted on the basis of evaluation designs with small samples, untested outcome measures, and the like, the less incentive anyone will have to do better.

Mr. Racine's comment is well taken, however, in following What Works for five years, it appears to me that one problem is the dearth of solid research. For every small study that is accepted, there are a slew that do not meet the standards of rigor (experimental or quasi-experimental design). They have responded to what the field seems to want by boiling everthing down to green and yellow light indicators--but a wise administrator should really look at the studies selected before investing in a strategy.

There does seem to be an emphasis (presumably based on what is submitted to them) on "canned" programs as opposed to identifying useful strategies.

It seems that the Dept. of Ed. "Doing What Works" site is a step in the right direction with regards to pertinence and practicality for educators. This is assuming, of course, that it is based on sound data. Has anyone addressed their analysis for teaching literacy to K-5 English Language Learners?

Seems to me that this represents a $30 million boon-doggle. Whoever had responsibility for this waste of tax-payers money should be publicly denounced.

The problem is with the rigor, but because most of the programs the WWC evaluated couldn't muster up to the rigor that was already in place! I question the timing around the birth of the What Works Clearinghouse and "who" at the Dept of Ed was responsible. I am curious to know how many were involved with Spellings and the same small group of people who gave us Reading First. Did they believe the clearinghouse could further their dirty work, and drive the final nail for Direct Instruction? Once the clearinghouse reported that NONE of the big publishing companies had any programs to meet the rigor of scientific evidence, and worse yet, that Reading Recovery was the only early reading program to receive positive ratings in all four domains...all of a sudden WWC was a bad idea?? Interesting how Mr. Slavin didn't mention anything about Success for All ratings from the WWC? If you don't like the findings, find somebody else to do it? Hardly a great idea for a "fresh" start.

The Alphabetics domain was one of four of which programs have been evaluated. Fluency, Comprehension and General Reading Achievement were the other three domains. The programs evaluated that did meet the criteria for rigor (scientific evidence) almost all got at least potentially positive ratings for effects on achievement in the alphabetic domain. That speaks more about the domain than a failure of the WWC. The reports of the WWC that I read provided a great deal of caution to the reader if the studies were small. And, if there was only one study that met the rigor, and demonstrated a positive effect, the Clearinghouse still did not give higher than a "potentially positive" rating. Go to the website and look across all four domains. Note how few programs could demonstrate even potentially positive effects in comprehension and general reading achievement. Ponder how a program might have received a positive rating for fluency, but the same program demonstrated a potentially negative rating for comprehension. It's not the WWC we should blame...they are only the bearers of the bad news. Mr. Slavin is picking the wrong battle. It's a crime what these programs are doing to our readers in this nation.

I still think of "Educator" as a four-letter word and "Teacher" as the highest of professions (I'm not one, I found). NCLB and my state of Ohio demand teachers use ".. scientifically based ... best practices ...", so did my principal. On the web I searched and found few or none for H.S. Math a couple years back. Then I thought I had struck gold, a site devoted to giving me just what I needed (Doing What Works)! Then I discovered the site had no useful strategies and focused on expensive "canned" programs or textbooks. Fool’s gold - - and still none for H.S. Looking at the "research" I found it claiming to be "statistically significant" yet barely so if at all and had such poor design(s) as to make it worthless (in my opinion). I want to see doubling of success (or more) over a broad range of schools and students before I would use a new method - - much the less invest in one. There are too many uncontrollable variables in education research. It was obvious to me that the site was being used by program developers to push their pricey products. I wrote it off as another boondoggle. ($30 Million? You've got to be kidding!) When are we going to find out how successes are really achieved, then spread the word. So far the only thing I see working is exceptional, devoted teachers working their _ _ _ off at the expense of their families, friends and $. I find most of the high profile school and class examples either demonstrate this or have hidden reasons why they work (like kicking out misbehaving students, having a militia of "monitors", enforcibly including parents, etc.) . WWC – a great concept, a great goal - -a lousy performance. Although Mr. Slavin, (“…the chairman of the Success for All Foundation, a private nonprofit company that provides a widely used K-12 reading program.”), is correct, I too fear his motivation for writing may be suspect.

What Works doesn't work primarily because it does not know what works. Dedicated teaching works, with or without a text book or program from the "experts" that are "researching" what works. instead of diverting precious funding to politicized "think tanks", the funding should go to schools and school districts where the real, on site research is being done every day, every minute of every day by the real researchers, the teachers. We should move beyond what is statistically significant and study and use what is really significant.

I think Dr. Slavin should study the What Works criteria more closely. There is an overall rating of the amount of evidence available for a program based on the size (e.g., number participants, schools). Additionally, Dr. Slavin's point about the length of programs is not relevant. Early literacy (pre-reading) programs often focus on skills that can be enhanced with relatively short programs.

I am sure that Dr. Slavin’s comments concerning the What Works Clearinghouse (WWC) will be interesting to many readers of Ed Week--for some because they dislike the idea of evidence-based practice, for others because they will disagree with his point of view, and for some because they appreciate meaningful debate within science. Of course, to be completely fair and consistent, Dr. Slavin perhaps should have revealed his implicit conflict of interest in his dismissal of WWC reviews and his “suggestions” for a different set of standards.

Dr. Slavin’s own commercial product, Success for All, was reviewed by the WWC and reported to have “potentially positive effects” in both “alphabetics” and “general reading achievement,” “mixed effects” in “comprehension, and no measured outcomes for “reading fluency.” This rating was based on seven studies, one randomized controlled study and six quasi-experiments in which schools self-selected into the Success for All group. Sixty-seven other reports (~91% of identified reports) that were purported to evaluate the impact of Success for All were excluded because of methodological or statistical features that, following WWC standards for interpretable evidence, could not support a causal interpretation. Many of these excluded reports had the design features that Dr. Slavin identifies as the standard by which studies should be included as providing the evidence base of programs. That is, they included large numbers of students who were exposed to the Success for All program in their schools for periods of one, two, or three years and typically completed commonly used standardized tests as outcome measures. However, from the WWC perspective (and the perspective of research methodologists and statisticians), the problem with these excluded studies is that they (a) failed to establish the equivalence of the groups compared before the Success for All program was initiated (thereby making it impossible to rule-out the possibility that observed differences at the end of the study were due to pre-existing differences or other alternative factors), (b) included confounding conditions whereby the effects of the program could not be disentangled from the effects of other factors (e.g., teachers, schools, classrooms), (c) employed statistical techniques that violated the assumptions underlying the statistical test, which typically inflates the probability of finding statistically significant effects, or (d) a combination of these issues.

Dr. Slavin raises a number of potentially important issues associated with the existing WWC Reports. Ratings should not be based on measures that are over-aligned with the program being tested. Comparisons among conditions should not involve an “apples and oranges” approach in which conditions vary on multiple dimensions. However, a reading of the reports and technical appendices concerning Saxon Math for Middle School Math and DaisyQuest for Beginning Reading, raises questions about what it was that Dr. Slavin was reading when he concluded that the WWC procedures “… ignore design elements with far more potential for bias than lack of random assignment” or that represent an “egregious example” of misleading reporting.

In the case of the Saxon Math Report, the rating of effectiveness was based on six studies with averaged effect sizes of .65 from 46 students, -.08 from 32 students, .41 from 78 students, .19 from ~3,000 students, -.13 from 185 students, and a nonsignificant and incalculable effect size from 28 schools with an unknown number of students (although the mean differences favored students using Saxon). The overall average effect size across all studies was .21. Of the eight separate statistical tests conducted across the six studies, half were statistically significant, and all favored the Saxon Math group. Only two of the eight comparisons favored the comparison group over the Saxon Math group, and only one of these was large enough to be considered substantively important according to WWC standards. The study with the largest number of students found statistically significant effects on two standardized measures that were a part of a state’s achievement test. Although one might reasonably quibble about whether the one randomized study without randomization problems that used a nonstandardized measure should have qualified Saxon Math for a “positive” rating of effectiveness rather than a “potentially positive” rating of effectiveness, the bulk of the evidence indicates that students taught using Saxon Math did as well or better than students taught with alternative math curricula on a variety of math outcome measures, most of which were nationally standardized or state assessments.

In the case of the DaisyQuest Report, the rating of effectiveness was based on four randomly controlled studies that included six comparison groups with average effect sizes for measures of phonological awareness of .66 from 49 students, .73 from 49 students, .90 from 27 students, .89 from 69 students, .85 from 69 students, and -.46 from 69 students. For measures of reading, the average effect sizes were .39 from 49 students and .31 from 49 students. Of the 18 separate statistical tests on phonological awareness outcomes across the four studies, 10 were statistically significant, all but one favored the DaisyQuest group, and of those that favored the DaisyQuest group, 14 were large enough to be considered substantively important according to WWC standards. Of the 8 separate statistical tests on reading outcomes included in one study (two comparison groups), none were statistically significant, all but one favored the DaisyQuest group, and six of these were large enough to be considered substantively important according to WWC standards. Given that DaisyQuest is intended to promote the development of phonological awareness, it is unclear why the measures of phonological awareness--the majority of which were versions of what are now commonly used and nationally standardized tests--used in the studies were the wrong measures to assess outcomes. Moreover, 75% of measures of reading showed effects that far exceeded the standard Dr. Slavin defined as “educationally meaningful.” Additionally, that a single comparison of a teacher-directed phonological awareness program outperformed the DaisyQuest program does not mean that the DaisyQuest program was ineffective. All children in five of the six comparisons were receiving some sort of reading instruction (i.e., they were kindergarten age or older). Students exposed to DaisyQuest outperformed students exposed to this regular reading instruction, as did students exposed to the teacher-directed instruction. Based on this evidence, both instructional programs worked.

Dr. Slavin also takes exception to the fact that some standards are different from review topic to review topic. However, this concern seems to ignore the fact that different programs are intended for different uses. It seems reasonable to evaluate a program within the timeframe for which it is commonly used or intended. Consequently, it makes sense to evaluate curricula over the course of a semester or a school year rather than five-hours of use. In contrast, a program like DaisyQuest is not intended as a curriculum, and consistent with the reported results of the National Reading Panel, about five-hours of supplemental instruction in phonological awareness approaches the asymptotic benefit.

Finally, Dr. Slavin appears to confuse sample size and interpretability. He suggests that studies with larger numbers of students are less likely to be biased. Of course, this is completely false. Design elements reduce bias, not sample size. A poorly constructed but large quasi-experimental study in which schools, teachers, or students self-select into the different groups is far more likely to produce a biased effect estimate than is a small well-designed randomized study. This bias is compounded when the wrong statistical procedures are used (i.e., ignoring clustering). Perhaps Dr. Slavin is actually concerned about generalizability. However, here again, a large sample does not by itself enhance generalization. Moreover, without an unbiased and causally interpretable estimate of an effect, one has to wonder what exactly it is that could be generalized anyway.

Whereas there are undoubtedly areas in which the WWC can be enhanced, the way to that better WWC does not involve giving up standards that yield unbiased and causally interpretable outcomes, as Dr. Slavin suggests. In many ways, the problem is not with the standards used by the WWC but with the standards employed by researchers conducting studies. These researchers and the journal editors who publish the results of uninterpretable studies should heed the call for a change in standards. Right now, however, the education field will simply have to recognize that there are far more programs available than there are programs with compelling evidence of effectiveness.

Dr. Slavin should more carefully study the reports on intervention programs. I looked over those of Stepping Stones to Literacy and found that Slavin misrepresented the evidence (studies)for this intervention. Slavin indicated that the evidence was based on 36 students. This represents the sample for one of two studies. The total number of students studied across two randomized field trials was 120. It appears that Dr. Slavin is distorting the facts to make a point.

Comments are now closed for this post.


Recent Comments

  • educator: Dr. Slavin should more carefully study the reports on intervention read more
  • Ed Researcher: I am sure that Dr. Slavin’s comments concerning the What read more
  • An Eduator: I think Dr. Slavin should study the What Works criteria read more
  • Bob, Teacher/Parent: What Works doesn't work primarily because it does not know read more
  • Al Peabody/ Tutor, Parent, Ret. Teacher: I still think of "Educator" as a four-letter word and read more




Technorati search

» Blogs that link here