« Inspiration and Perspiration | Main | Educational Testing: A Brief Glossary »

# An Immodest Proposal

This year’s statewide fourth-grade math exam administered in New York State -- the one with the remarkably high gains -- contained the following item:

*“Janice bought a notebook for $3.75 and a pencil for $0.47. She gave the cashier $5.00. How much money did Janice receive in change?”*

The item might have looked a little familiar to fourth-grade teachers. In 2007, a similar item appeared:

*“Tony bought art supplies that cost $19.31. He gave $20.00 to the cashier. How much money did Tony receive in change?”*

And in 2006, an item read:

*“Mr. Marvin spent $54.10 on pants and shirts. He gave the cashier $60.00. How much money should Mr. Marvin receive in change?”*

Other similarities abound. In 2008, an item read:

*During the year, one thousand eight hundred four books were checked out of the school library. What is another way to write this number?*

*A. 184
B. 1,084
C. 1,804
D. 1,840*

There was an uncanny resemblance to an item on the 2007 test:

*The number of people who live in Goodwin Falls is three thousand nine hundred eight. What is another way to write the same number?*

*A. 398
B. 3,098
C. 3,908
D. 3,980*

To be sure, the test-takers in 2008 still had to answer these questions correctly to get credit for them. **But the similarity in item formats across the years gives some credence to concerns that scores are inflated.**

Dan Koretz discusses the problem of score inflation in his excellent new book, *Measuring Up: What Educational Testing Really Tells Us*. One source of the problem, he explains, is that all tests sample the subject-matter domains that they are supposed to tap. If the same kind of item shows up repeatedly on the test from one year to the next, teachers and administrators can focus on this restricted set of test item types, and neglect other item types that are still part of the domain that the test is intended to represent.

The National Assessment of Educational Progress (NAEP) is sometimes referred to as the “gold standard” for standardized tests, and claims about test score inflation in a test, such as an NCLB-mandated state test, are often grounded in a discrepancy between NAEP and the other test either in the level of or trend in performance . The characterization of NAEP as the “gold standard” reflects the fact that it is designed to measure a much larger sample of student performance in a domain than is the typical state test. No individual child takes all of the items in the NAEP item pool; instead, students complete test booklets with blocks of items. In the 2000 12th-grade mathematics NAEP, for example, students completed one of 26 different test booklets, each containing three 15-minute blocks out of a total of 13 different blocks of mathematics items. Each student was asked to complete about 40 items across the domains of number sense, properties, and operations; measurement; geometry and spatial sense; data analysis, statistics and probability; and algebra and functions.

Overall, enough students respond to all of the items in the NAEP item pool to be able to measure how well the population of students in a state (or large urban district) is doing. But NAEP is not designed to yield scores for individual students, because no student responds to enough items to yield a reasonably precise measure of performance.

With tongue firmly in cheek, skoolboy offers the following solution to test score inflation: **more testing**. Imagine if students completed the entire pool of NAEP items (or some other broad pool of items assessing performance in a domain), instead of the relatively restricted sample of items used in most state-level testing programs. If students were assessed on a broad array of items tapping subject matter competence, teachers and administrators would not be able to concentrate their attentions on a subset of item types, and hence would not be able to artificially raise students’ scores relative to their true learning of the subject. Sure, the burden of testing would increase; we'd need to invest in better and more expensive tests; and increased testing wouldn't solve the incentive problems that high stakes create.

More testing. An idea whose time has come?

Nah.

Predictable questions on a math exam? And that leads to inflation?

What if they all were predictable, every year? But the kids still had to think ("change means subtract" or "translate a number from words to symbols, but be careful to put in a 0 for any missing place values")

Wouldn't we have a much more reliable instrument? And wouldn't we have a much better idea of what we were measuring?

I wrote about something similar the other day.

How about another immodest proposal: why not make state tests more like NAEP? NCLB requires school-level results, not results for individual children. So why not use matrix sampling to test a broader range of content? California did this briefly with its ill-fated CLAS, and Maryland did it for several years with MSPAP.

I can see at least two problems. Sampling makes growth models complicated, if not impossible. And there is a strong desire for individual student results, which is one big reason California and Maryland scrapped their tests. But why does one test have to do it all? Couldn't districts and schools provide individual-level results (I know, they're not comparable...)

Vacation break for this one comment:

I'm with Bob - If we're going to continue to fixate on proficiency, I'd love to see state tests use matrix sampling for school level accountability. Perhaps districts could give a norm-referenced test like the Stanford to provide comprehensive, comparable results for every student. One test really can't do it all, and I see a lot of benefits of decoupling school/individual level assessment issues.

If a state's content standards include the skill of "equating" 2,345 and two thousand three hundered forty-five, then the state test ought to address the skill. If everyone knows that a skill will be tested, it will be taught, and student performance and test results on that skill will increase over time or "inflate" as you would say. Three cheers for this type of inflation.

I commented nostalgically on Jonathan's blog on the predictability of the old (early 70's vintage) NYS math regents -- I think an exam that predictably covers every important topic in the curriculum can be a good thing. There will, however, be a learning curve as teachers and students grasp what the examiners consider the important areas and start to focus on those. As that happens, student scores will rise, and teaching will become more focused on the topics covered by the test.

If the test really is a good reflection of what we're most concerned that students know, that dynamic is not necessarily a bad thing. As a colleague on mine likes to say, teaching to the test is fine, as long as it's a test worth teaching to. But a predictable test WILL narrow the curriculum, so its important that the test really cover what we want the curriculum to be.

We should also be clear on the learning curve part of the dynamic -- if test scores are rising because the test is becoming predictable, we shouldn't attribute the gains to the latest innovation in this or that school, but rather to the tendency of teachers and schools to adapt to what is tested.

I think part of the issue that arises in discussions of different standardized tests or the NAEP versus other tests has to do with what the tests are designed to measure.

There are two primary things we're trying to do with standardized tests:

1) School, teacher, and districts performance (i.e. how effective are school policies and institutions).

2) Individual student evaluation. (i.e. is the student at grade level, should this student advance, what does this student need to work on, should this student graduate, etc..)

These are two very different goals. However (I think primarily because of ease and cost concerns) in the US we use the same tests for both of these goals. The problem with this method is that the types of tests/measures you would design for these two goals are very different (a lot of this has to do with statistical related to the unit of interest).

For goal 1) NAEP style tests that do not test all students and that do not give all students all the questions are fine and even preferable. You simply want to test how well that school/district/teacher is teaching students that concept on average. How well any particular student does is irrelevant. Consequently, not all students need to get all parts of the test and not all students need even take the test (as long as the selection process is done randomly). Additionally, you can be more confident in your results because you can have a larger sample size (hundreds or thousands of students).

However, for evaluating individual student performance (purpose 2) a very different set of test has to be administered. These tests must a) cover all the material--or close to it--because we need to know how well the student understood the material as a whole. These tests must also be long or be administered several times because of sample size (i.e. if you use 1 short test to evaluate a student you're essentially giving that student a sample size of 1--which any statistician will tell you is ludicrous for evaluating anything).

I think if we divorce these two types of evaluations--by making one set of tests to measure school/district/teacher performance and another set to measure and evaluate individual student performance we would much more effectively understand how are schools work and how our students are performing (and we would be fairer to students).

This would be expensive, but I think the educational and policy benefits would outweigh the costs. By trying to measure two things at the same time all we do is measure both poorly.

NAEP has value because it allows us to guage the many state assessments by some kind of common measure, as well as providing some state level data. Except for districts that elect to be oversampled (ie the urban sample) it doesn't provide good data for a smaller grain size. This is a problem for us in the US because so much of education is locally determined, with the exception of Hawaii. We really do need to have data that is appropriate to the decision-making level. The test every kid approach gets us there, kind of. Greater levels of standardization also help. Just please don't go back to the supersecret days of everyone selecting their own standardized tests and keeping the results under wraps. The public has a stake in the quality issue--however measured.

Jonathan and Bert D: the problem is not with the test items, but rather the inferences we draw from them. An item is just a representation of the underlying skill or competency that it’s supposed to measure, not the skill itself; and there’s usually more than one possible representation. If a test repeatedly only has one item type to represent a skill such as two-digit subtraction with regrouping, and students learn how to respond to that item type, but not to another item type that

alsorepresents the underlying skill, there is a risk of inferring that students have mastered that skill, when in fact they might not perform as well on other representations of the skill.Bob: Dan Koretz argues that the public’s desire for individual scores is too strong for this to fly, and I would worry about the precision of proficiency scores for the required subgroups, which can be pretty small in many schools. But for the most part, your immodesty lines up with mine.

eduwonkette: Back to the beach, rehab, your secret meeting with Barack, or wherever you are. You’re undermining the validity of the construct “blogging break.”

Skoolboy,

two-digit subtraction without regrouping IS a skill. What underlying skill are you talking about?

And that's the point. It's not bad if the kids know what's coming - they still have to respond appropriately. And that appropriate response is good math. Somewhere along the way, someone decided that it shouldn't count unless we surprise them... that's just wrong.

Jonathan,

I apologize for my lack of clarity. Two-digit subtraction with regrouping is a mathematical skill, to be sure. But I think there's more than one kind of test item that could assess a student's mastery of that skill. A test that repeatedly only uses one of the several kinds of test items that tap the skill of two-digit subtraction with regrouping is vulnerable to score inflation, because there is no guarantee that students could correctly demonstrate their mastery of that skill with a different item format.

On your blog, you raise what I view as an ancillary concern -- test items that measure not only the desired skill or competency but also some

otherskill or competency. A good test item will measure precisely the skill that it is supposed to measure, and nothing else. An item that surprises a student by introducing some other competency beyond the desired skill is not a good item, because it confounds the desired skill with something else.Margo/Mom,

There's nothing magical about NAEP -- its status as "gold standard" refers to the broad coverage of many different skills and competencies, as determined by the National Assessment Governing Board, but in the absence of a national curriculum, no state is obliged to align its curricular standards with the standards implied by NAEP. This means that comparisons among states must be made with considerable caution, since they might reflect differences in curricular emphases across states, and not differences in performance for the same curricular standards.

"Two-digit subtraction with regrouping is a mathematical skill, to be sure. But I think there's more than one kind of test item that could assess a student's mastery of that skill."

Not so fast. I see only one type of question here. If I want to see if a kid can subtract a 2 digit number with regrouping, I ask him to do it. Why would we do something else?

And what's wrong with lots of little kids doing exactly that?

Let them see the test, we change the numbers, and let them perform. If we stop there, that's not enough. But if we start there, and the kid can do what we ask, aren't we doing well?

I don't understand why the tongue is planted firmly in cheek.

What's wrong with having a clear standard like:

Students will be able to subtract a two or three digit number from a three digit number with regrouping from hundreds to tens.

And clear performance indicators like:

423 - 171 = ?

or

418 - 83 = ?

This would give teachers guidance as to exactly what to teach and how it would be tested, as opposed to the vague standards that currently exist in most states.

There are about 17 different skills for subtraction (at thye elementary math level). Each skill should have a clear standard that indicates exactly what skills are to be taught and how they will be tested.

The testing instrument would be a criterion test comprising a semi-random selection of a subset of these skills.

Teachers would be able to teach to the test which is exactly what we want, but they'd have to teach all the skills since there'd no no way of knowing which subset of skills would be tested.

Isn't this exactly what we want out of education?

Jonathan: Let me try again. I can imagine two different item formats for the skill of two-digit subtraction with regrouping: vertical, and horizontal. Would you agree that both formats can represent the same underlying mathematical skill? But if a test always uses the vertical format, and a teacher only teaches students how to solve an item using that format, then there is a risk that the test won't accurately reveal students' knowledge of the concept. A student's inability to respond correctly to an item using the horizontal format would be evidence of score inflation.

More generally, I'm suggesting that whenever the items on a test consistently do not represent the full range of item formats that we think a student who has mastered an underlying skill or competency should be able to answer correctly, there will be a risk of score inflation.

Lining up numbers by place value is clearly a separate skill.

I think it's a mistake to focus on the test and not on the math.

This subject is different.

Now, if you limit your point to inflation on the current test - maybe. But that's awfully hard to tell. I'm certainly not convinced that they are consistent year to year.

Somebody, please review the above debate and then explain to me why we should have confidence that 4th grade testing results can be generalized in and that we can devise testing that will help kids make it in middle school, high school, and beyond.

Presumably, Jonathan or Kderosa has a better argument, which means that one might have a better approach to test- design than the other. If we do a better job of breaking down 4th grade skills and/or practices, in say 17 ways, then presumably we can do a better job in teaching in 5th grade.

The complexity of these tasks goes up geometrically as kids grow up. When will we get to the point where there is some agreement that can help us in high school?

From my layperson's perspective (on this debate) the posts illustrate schoolboy's point. I read schoolboy as saying that we should look at the actual wording of the test. Then we should take a step back and reflect on what this data tells us. The debate, however, pulled us back into the minutia, as if numbers as opposed to children are the bottom line.

Testing is a tool to help people. I want test designers to love their job and be committed to designing better tests. But education is still a people business, and we need to focus on whole human beings and the complex ecology of schools, and not 1/17th of a person's ability to subtract.

John,

The nature of math dictates 90% or so of what has to be taught to a student. It doesn't matter if that student is gorilla, a tree, an alien, or a human, the content of math and the nature of that content doesn't change. Of course, how you teach that content may change depending on certain characteristics of the student.

So, for a student to know how to subtract it is necessary for that student to learn all the subtraction and underlying skills. The student should also be able to demonstrate the skills acquired on, say, a test of subtractioon problems embodying those skills. Knowing 42% of those skills doesn't matter much; the student must know them all, each and every one of them. Is there another way to know subtraction without knowing these skills?

So if we keep the kind of test we are using now, repeated items will reward those who train by doing old test items. This could inflate scores. I get it.

But why not have more predictable tests in mathematics? Why look at underlying skills when we can test most math skills directly?

We need to look beyond testing theory, and back at the material being tested.

Mathematics is different from other subjects. In many ways it has more in common with skills we test (certification for swimming, drivers test).

I test with questions that look a lot like homework questions and classwork questions. I think that's a good thing. My teaching is much more varied than that. And my tests are not 100% predictable. But majority, yes. I write them that way.

Jonathan,

I agree with you that the structure of knowledge in a discipline has a lot to do with what kinds of test items are appropriate. Mathematics differs from many other school subjects in this regard.

I think also that there sometimes is a lack of clarity about what the desired skill or competency is, as witnessed by some of the commenting regarding the

applicationof a skill, which might involve more novelty. The issue of how knowledge transfers from one context to another, and how we can measure knowledge transfer, is something I wish I knew more about.Of course there are going to be similiarites in math test questions. There are only so many ways to ask a word problem such as "During the year, one thousand eight hundred four books were checked out of the school library. What is another way to write this number?"

The real question comes down to what is the test measuring? Is it measuring that the student can translate words into numbers no matter what the question around the words are or is the test trying to determine if a student can thnk and adapt to new situations and apply the knowledge learned to accomplish the task. Since most of us would agree that the tests required by the goverment require them to perform the first option then I see nothing wrong with a rewording of the problems. If they want students to think at a higher levels, well that is another story.

so all this debate applies to elementary math and elementary math only?

Perhaps we will work through similiar issues for elementary Reading.

I'm cool with whatever you guys work out just as long as you don't try to extrapolate it to secondary schools.

I teach high school mathematics. The examples we started with were from a 4th grade exam, but secondary, too? Yes.

I taught grades 4-6 through the 70's-mid 80's and did a lot of test prep in math in the 4-6 weeks preceding the test.

The computation part oof the test was a given.

But the problem solving required a different approach.

Since we were never up to the curriculum, we had to take a bunch of short cuts to give the kids at least a superficial knowledge of certain questions.

We used the language of old tests by looking for key words and automating a way of addressing the problem. Thus, we were actually teaching some reading more than math.

I would say that 10 minutes after the test a bunch of this stuff diappeared from their heads. But with the test over, we could go back to teaching math in an ordinary way. At least that way the knowledge didn't disappear until during the summer. Then next year's teacher could start all over again and blame the idiot teacher from the year before for doing such a lousy job.

Norm,

When I read your comment I felt something similar to what I felt a few minutes before when I read Diane Senechal and her quotes of Yeats. You both expressed truth without simplistic dogma. You both expressed the complexity of life and learning. Your post could have been a rough draft of a short story. If we are going to address the challenges of education we need great poetry and short stories and all forms of literary expression.

Jonathan, since you teach high school, I'd like to hear more about your logic and when 4th grade can be extrapolated and when it can't. So here's my sincere question. How many math teachers do we have who need deep instruction in to how to break down instruction and assessment the way you have been doing it in this debate? Shouldn't our priority, in this area, be the recruiting, the retention, and the professional development of teachers? Math is the subject that is most amenable to standardized testing isn't it? Math is the subject where we have the most potential for applying this discussion in a practical way, isn't it? Wouldn't that alone be enough of a challenge.

Do you agree that there is nothing in the above discussion, that we can systemically apply to high school liberal arts? Certainly, if an educated layperson listenes intently to your debate, that experience will not make us more confident that NCLB-type accountability can be fixed.

Snarky Sal takes a jibe

Yes, test scores are inflated. Yes, my fellow teachers and I are aware that the test items are similar year after year. Yes, teaching to the test is not only possible but highly encouraged. So now it is not so much a matter of whether or not students are learning… but whether or not their teachers willing to conform.