Response to 'Validity, Test Use, and Consequences: Pre-empting a Persistent Problem'
I want to begin this month-long conversation between academics and school leaders by acknowledging Madhabi Chatterji's insight that such a conversation around accountability and assessment is urgently required. And I want to thank Education Week for providing the platform to make it possible.
Assessment--whether of individuals, programs, schools, or our national progress--is clearly an essential aspect of school policy. Nobody questions the need for assessment and accountability. But how the field goes about this work and who is privileged in this discourse are matters that deserve the sort of discussion this blog is intended to promote. Today test-makers, public officials, and business and foundation leaders, not educators, dominate that discourse. A better balance is badly needed; front-line educators must be brought into the discussion.
An iceberg is a near-perfect metaphor for the field of assessment. On the surface we see numbers and results, impressive in their precision. They seem to define reality. Surely they're accurate. We can accept them at face value. But below the waterline we find nine-tenths of the structure, invisible to the passengers on deck. Here are hidden the assumptions that support the numbers: the mind-numbing complexity of Rasch models and Item Response Theory; the use of "plausible values," "bootstrapping," "jackknifing," and other statistical tricks of the trade; decisions about how to handle cut-points, benchmarks and the possibility that test items function in different ways in different circumstances; and the hundreds of judgments made every day about measurement and sampling error (because there's always some error), and how to link this year's modification of an assessment (or different versions of the same assessment) to prior iterations.
As my friend Madhabi Chatterji has described it, technical validity is a critical component of any assessment. It's one of the central concepts in measurement: Are you actually measuring what you said you wanted to measure? She has pointed out: "Even well-designed, well-researched, and well-intentioned assessments may be misused or used in contexts that stray far from the setting originally intended or assumed by test developers."
But we must also ensure that the accountability and assessment enterprise is valid in a broader, non-technical sense. The amount of time devoted to test taking and preparation is already well in excess of what most educators (and members of the public) would consider reasonable. As teachers teach to the test, we could find the curriculum narrowed, while demands for teacher evaluation ask assessment experts to do something that might well damage students who need the greatest help. Congress and the National Center for Education Statistics insist that the benchmarks for the National Assessment of Educational Progress should be "used on a trial basis" only and "interpreted with caution." Yet both Common Core assessment consortia have thrown caution to the wind in adopting NAEP's controversial "proficient" benchmark as a national standard, a decision that guarantees a majority of students--in public, private, and charter schools--will fail to clear the bar.
Educators pointing to these issues can be dismissed as acting out of self-interest. That's why they need to be part of the discussion, brought in to advise on the take off, not just help with the landing.
On international assessments, a number of concerns about reporting and interpretation require attention: the exaggeration of the significance of small differences in mean scores; the fact that bodies sponsoring different assessments can produce quite strikingly different results for the same countries in the same subject; and greater clarity about the significance and implications of an international mean that is an average of the national averages instead of a weighted mean.
But beyond that, we should be cautious about making educational comparisons among countries largely on the basis of mean test scores. How can a large, diverse, democratic nation such as the United States be meaningfully compared with tiny, homogenous principalities such as Liechtenstein or dictatorships such as Khazakstan? This question is all the more troubling in light of the acknowledgment by an OECD official earlier this month before a committee of the British House of Commons that the much-ballyhooed Shanghai PISA assessment ignored more than a quarter of the 15-year-olds in the city, mostly low-income rural migrant youth.
Lay people like myself often scratch our heads trying to make sense of assessment reports. We take it on faith that the specialists understand what they read. It is troubling to learn, therefore, that a "[A] 'black box' [of] increasingly sophisticated measurement models [may make] assessment reports and results incomprehensible to many measurement specialists themselves."
This all boils down to a modification of the tongue-in-cheek question first posed 1900 years ago by the Roman satirist Juvenal: "Quis custodiet ipsos custodes?" he asked. "Who watches the watchmen?" Assessment experts are the watchmen of the nation's educational standards. It's time to examine who is responsible for what's going on below the waterline. How, in brief, are we to assess the assessments? That's a question this blog is designed to explore.
National Superintendents Roundtable