Readers are often - and understandably - frustrated when it comes to reports of educational experiments and effect sizes. Let's say a given program had an effect size of +0.30 (or 30% of a standard deviation). Is that large? Small? Is the program worth doing or worth forgetting? There is no simple answer, because it depends on the quality of the study.
The Institute of Education Sciences recently issued a report by Mark Lipsey and colleagues focusing on how to interpret effect sizes. It is very useful for researchers, as intended, but still not so useful for readers of research. I wanted to point out some key conclusions of this report and a few additional thoughts.
First, the Lipsey et al. report dismisses the old Cohen characterization of effect sizes of +0.20 as "small," +0.50 as "moderate," and +0.80 as "large." Those numbers applied to small, highly controlled lab studies. In real-life educational experiments with broad measures of achievement and random assignment to treatments, effect sizes as large as +0.50, much less +0.80, are hardly ever seen, except on occasion in studies of one-to-one tutoring.
The larger issue is that studies vary in quality, and many features of studies give hugely inflated estimates of effect sizes. In order of likely importance, here are some factors to watch for:
- Use of measures made by the researchers.
- Very brief studies (often, one hour or less)
- Studies with small sample sizes
Studies that incorporate any of these elements can easily produce effect sizes of +1.00 or more. Such studies should be disregarded by readers serious about knowing what works in real classrooms and what does not.
The Lipsey et al. review notes that in randomized studies using "broad measures" (such as state test scores or standardized measures), average effect sizes across elementary and middle school studies averaged only +0.08. Across all types of measures, average effect sizes were +0.40 for one-to-one tutoring, +0.26 for small-group interventions, +0.18 for whole-classroom treatments, and +0.10 for whole-school treatments. Perhaps we should call effect sizes from high-quality studies equivalent to those of one-to-one tutoring "high," and then work backward from there.
The real point of the Lipsey et al. report is that the quality and nature of studies has to be taken into account in interpreting effect sizes. I wish it were simpler, but it is impossible to be simple without being misleading. A good start would be to stop paying attention to outcomes on researcher-made measures and very small or brief studies, so that effect sizes can at least be understood as representing outcomes educators care about.