This year’s statewide fourth-grade math exam administered in New York State -- the one with the remarkably high gains -- contained the following item:
“Janice bought a notebook for $3.75 and a pencil for $0.47. She gave the cashier $5.00. How much money did Janice receive in change?”
The item might have looked a little familiar to fourth-grade teachers. In 2007, a similar item appeared:
“Tony bought art supplies that cost $19.31. He gave $20.00 to the cashier. How much money did Tony receive in change?”
And in 2006, an item read:
“Mr. Marvin spent $54.10 on pants and shirts. He gave the cashier $60.00. How much money should Mr. Marvin receive in change?”
Other similarities abound. In 2008, an item read:
During the year, one thousand eight hundred four books were checked out of the school library. What is another way to write this number?
There was an uncanny resemblance to an item on the 2007 test:
The number of people who live in Goodwin Falls is three thousand nine hundred eight. What is another way to write the same number?
To be sure, the test-takers in 2008 still had to answer these questions correctly to get credit for them. But the similarity in item formats across the years gives some credence to concerns that scores are inflated.
Dan Koretz discusses the problem of score inflation in his excellent new book, Measuring Up: What Educational Testing Really Tells Us. One source of the problem, he explains, is that all tests sample the subject-matter domains that they are supposed to tap. If the same kind of item shows up repeatedly on the test from one year to the next, teachers and administrators can focus on this restricted set of test item types, and neglect other item types that are still part of the domain that the test is intended to represent.
The National Assessment of Educational Progress (NAEP) is sometimes referred to as the “gold standard” for standardized tests, and claims about test score inflation in a test, such as an NCLB-mandated state test, are often grounded in a discrepancy between NAEP and the other test either in the level of or trend in performance . The characterization of NAEP as the “gold standard” reflects the fact that it is designed to measure a much larger sample of student performance in a domain than is the typical state test. No individual child takes all of the items in the NAEP item pool; instead, students complete test booklets with blocks of items. In the 2000 12th-grade mathematics NAEP, for example, students completed one of 26 different test booklets, each containing three 15-minute blocks out of a total of 13 different blocks of mathematics items. Each student was asked to complete about 40 items across the domains of number sense, properties, and operations; measurement; geometry and spatial sense; data analysis, statistics and probability; and algebra and functions.
Overall, enough students respond to all of the items in the NAEP item pool to be able to measure how well the population of students in a state (or large urban district) is doing. But NAEP is not designed to yield scores for individual students, because no student responds to enough items to yield a reasonably precise measure of performance.
With tongue firmly in cheek, skoolboy offers the following solution to test score inflation: more testing. Imagine if students completed the entire pool of NAEP items (or some other broad pool of items assessing performance in a domain), instead of the relatively restricted sample of items used in most state-level testing programs. If students were assessed on a broad array of items tapping subject matter competence, teachers and administrators would not be able to concentrate their attentions on a subset of item types, and hence would not be able to artificially raise students’ scores relative to their true learning of the subject. Sure, the burden of testing would increase; we'd need to invest in better and more expensive tests; and increased testing wouldn't solve the incentive problems that high stakes create.
More testing. An idea whose time has come?