Education Research: Reflections on the 'Gold Standard'

In my last blog, I pointed out that, by focusing on what policies and practices work, rather than what designs for education systems work, American researchers are actually contributing to the sub-optimization of those policies and practices.  If the system in which those policies and practices are embedded is dysfunctional, then research that optimizes particular policies and practices for use in that system will identify policies and practices that are highly unlikely to be the best policies and practices in a system that is functioning well.

That is, I believe, true in general.  But it is true in particular for the most highly regarded of research methods, the "Gold Standard" of education research, Randomized Control Trials (RCT).  This method consists of selecting at random enough people from the population of interest to get statistically sound results and assigning them to receive a particular treatment, and then identifying a random sample of the same size from the same population that will not get the treatment and comparing the results.  This research methodology is said to be the "Gold Standard" because it is thought to enable the researcher to make a very strong argument that the treatment caused the observed results, if the group that got the treatment or program performs very differently from those that did not.

The elevation of RCT to Gold Standard status was an important achievement.  It was developed and promoted to counter the perennial—and all too often well-deserved—critique of much education research that it was not research at all, but rather ideological posturing dressed up as research.  Its proponents wanted to elevate the standing of a method that could leave little doubt about claims of causality.  That was a worthy goal.

However, in my last blog, I argued that the most important single obstacle to the improvement of the performance of the American education system is the overall design of the system itself, rather than the design of particular elements within that overall design.  To illustrate, with a greatly oversimplified example, I will claim that a system that is designed to make the most efficient use of an endless supply of cheap, poorly educated teachers working in schools, the dominant organizational form in the industrial America of the 1910s and 1920s, is no match for the systems top-performing countries are now using, which is designed to recruit their teachers from the top levels of their college students, educate and train them well and provide for them a modern professional workplace, with professional careers, professional compensation and advancement systems, as well as professional management.  

You will note that I just made an assertion of fact, which includes an attribution of causation.  In effect, I said that the collection of policies being pursued by the countries with highly effective education systems could be described in way X and, further, that those polices, taken not just as a collection of independent policies, but as a coherent policy system, cause the superior performance of those systems.  Ok, you might say: Prove it!

And so I rush out to prove it with the most persuasive evidence I can muster, an RCT.  But I have a problem.  I cannot randomly assign national populations to national education systems.  The idea of dividing into two groups many thousands of the children of England, let's say, chosen at random, and sending one group to experience the education system of the Czech Republic, while the others languish in England, is laughable.  It could not be done.

But, if I could get beyond that problem I would have another.  The experiment would take years.  During that time, if PISA had been reporting superior results for the Czech Republic, English teachers would have been crossing the English Channel to bring back the Czech innovations that struck them as useful, a process the English authorities and the researchers could not prevent.  

The next problem is fatal.  I would have no audience for my results.  The people responsible for national education systems are not replicators.  Each country has is its own values, history, politics, professional views, and just plain baggage.  So they have no interest in replication; they are creative adaptors, fashioning solutions to educational challenges in ways that reflect the needs of their own students, their own values, their own aims and their own circumstances.  Even if RCT could prove that one nation's overall education model "caused" its superior results, it could not with any certainty show which features of that model were responsible and, even if it could, no nation would adopt that model in the form in which it was tested anyway.

So what is the alternative to RCT if the aim is to learn as much as possible about the characteristics of effective national state and national systems of education?  Our answer is industrial benchmarking.  Described at greater length in a piece I wrote years ago, industrial benchmarking has its origins in the early 1980s, when Japanese manufacturing firms were beating their American counterparts on quality, price and time-to-market, putting many of them out of business.  The best of the American firms survived by working hard to understand how the Japanese were beating them.  The American firms were not interested in copying anyone.  The sent their engineers to visit a full sample of the top Japanese firms, using a wide variety of methods to collect and analyze relevant data.  They visited factories; talked to foremen, top managers, analysts, policymakers, bankers, independent industry experts, and many others.  They read reports, looked at data and made highly detailed records of their conversations and what they saw in competitors' factories.  And, with their notebooks and computers full of what they had learned, they returned to the United States to put together business plans, new product designs, reformulated production methods and new training systems that enabled many American firms to beat the competition, not by copying their competitors, but by learning from them.

This is what my organization and others have been doing for years.  It is what the top-performing countries have been doing for years, to learn from each other.  The critics denounce this approach as obviously unscientific, while at the same time using spurious examples of RCT in the comparative literature to show that RCT can in fact be used for comparative purposes.  They are spurious in the sense that the examples are not in fact examples of the use of RCT to compare entire systems of education.

So I end where I began.  A very nice paper by Dylan Wiliam, Emeritus Professor of Educational Assessment a the University of London's Institute of Education, points to a number of problems with the RCT method that need attention.  But, for me, the central issue is that this "Gold Standard" of methods fails most completely in the very arena in which better theory and better research methods are most urgently needed.  The mixture of methods I described as industrial benchmarking do not collectively have the technical elegance of RCT, and they require more judgment and less quantitative sophistication than is currently in vogue.  But they enabled American manufacturers to learn what they needed to learn to counter the Japanese onslaught and stay in business, and they have been good enough to propel the world's leaders in education to their current spots at the head of the global league tables.  RCT should have worked but didn't.  Industrial benchmarking shouldn't have worked but did.  Don't you suppose that the United States ought to add it to our research arsenal in a serious way?

