« Teachers Next on the Budget Chopping Block? | Main | The Fight to Define the Status Quo »

Time to Get Smart about Assessment


I have been impressed by President-elect Obama’s pledge to steer clear of ideologically-driven policies, and instead choose to make policy based on the best ideas, regardless of their origin or political correctness. In that spirit, let’s take a look at the hot-button issue of assessment.

From former test-scorer Todd Farley comes a confession that the test scores by which our schools are judged are less than reliable. Mr. Farley worked scoring the National Assessment of Educational Progress, which calls itself nothing less than “The Nation’s Report Card.” He writes:

There’s not enough column space in this newspaper to list the myriad discrepancies I’ve seen in the scoring of short-answer/essay questions on “standardized” tests, but in my opinion, test scoring is akin to a scientific experiment in which everything is a variable. Everything. In my experience, the score given to every open-ended response, and ultimately the final results given to each student, depended as much on the vagaries of the testing industry as they did on the quality of student answers.

Mr. Farley goes on to describe in detail the limitations of his fellow scorers, some of whom had a limited grasp of the English language, and one who was unaware that he was actually scoring student work.

Of course you would never guess that there was any question about reliability from visiting the NAEP website. Nor would you guess from policy proposals that have emerged from the wreckage of NCLB that suggest that NAEP be used as the basis for some kind of national standardized test, allowing comparisons of students across the country. In fact, NAEP tests are so highly regarded that the No Child Left Behind Act requires states to participate in NAEP as a condition of receiving federal Title I funds.

I do not really know if NAEP is any better or worse than other standardized tests. I just hope that this revelation helps us look beyond the supposed precision provided by those ever-so-scientific looking test scores.

I also hope this leads us to take a broader view of assessment. Classroom-based assessments are often discounted as being unreliable and subjective. The role of the teacher has been reduced as standardized tests have become the crucial judges of our success. But when we read how the NAEP is being scored, we get the feeling that the vaunted objectivity of standardized tests may be less than it has been cracked up to be.

Research has shown that high school grades assigned by teachers are the best predictors of success in college, so maybe teachers are not as unreliable and subjective as we thought. Perhaps it is time to reinvest in teacher-based assessment practices. Teachers need to learn to assess more deeply, and apply what they are learning to provide students with timely feedback. It will take professional development and time to develop these skills. But if we took all the energy and resources that now go into high stakes tests and test preparation and turned that energy towards smarter authentic assessment practices, linked directly to classroom learning, I believe we would see better results, both in terms of the quality of assessment, and in terms of better-informed instruction.

If you aren't sure what I mean by authentic assessment, take a look at some of the resources here on this site hosted by the University of Wisconsin-Stout, and this Authentic Assessment Toolbox created by Jon Mueller.

What do you think about the validity of high stakes tests such as the NAEP? How do you think our assessment practices should be strengthened?


I think we should go back to how things used to be prior to No Child Left Behind. I am a teacher and there seems to be a lot of stress, especially in the math department at my school due to our being under warning because of math. I went through school just fine as a student without these "stress tests".

Part of the motivation for No Child Left Behind -- the good part -- was the fact that a great many students were indeed falling behind, and there were not systematic mechanisms to force educators to respond to that fact. I do not happen to believe that the punitive approach of NCLB has had the desired effect, nor do I think the assessment strategies the law has ended up promoting are working to move all children ahead. But I think we should move forward acknowledging that we need to design effective assessment strategies that identify struggling students so that we can give them the support they need. Teachers can do a much better job at this task than is currently being done by standardized tests, in my opinion.


It's pretty hard to respond to your question. First off, NAEP is not a high stakes test. To the best of my knowledge it is a no-stakes test--and has been since prior to NCLB. Second--what do you think about validity? You are asking for opinions about validity? Thinking back to my last statistics course, I believe that validity measures the degree to which an indicator actually measures what it is purported to measure. Todd Farley casts aspersions on reliability--something different. Reliability pertains to whether the measure is likely to be accurate--whether the same results can be reproduced time after time. His descriptions of test readers sound horrendous. However, what would constitute reliability would be whether or not this group would produce the same or similar results given the same test item. This is measureable. The 80% reader agreement that he alluded to would be a measurement of reliability. The way to "make the statistics dance," if there were problems would be to fire consistent outliers and to increase the number of readers grading each paper (thereby reducing the randomness--I believe).

Now--it is true, there is sometimes a trade-off, or balancing act, between validity and reliability. It is much easier to come up with very reliable multiple choice tests. However, validity suffers--you cannot get the same depth in a multiple choice or single word fill in the blank item that you can in a written solution to a problem. You cannot be as certain that a student really knows and understands the material. Yet some of the more valid measures suffer from the kind of reliability problems that Todd Farley describes. So, in most cases, the state tests (some of which are high stakes), include some of both. I see far more focus on training kids for the less valid measures--the multiple choice type--than I do on the more valid but less reliable ones. I have heard that in my state this is because the teachers have figured out that kids can pass without the more open-ended questions. Maybe this is true, or maybe teachers just don't have as good a handle on how to teach that kind of depth.

I do know that every state publishes an annual report on the psychometrics of their test. Most of us don't read them--I don't, I don't begin to understand all of that. But, if it comes down to it--I'd rather have a discussion with someone who does about why the tests are/are not reliable and/or valid than to rely on the word of a former test reader (who spent fourteen years in a job that he characterizes as one that draws primarily dingbats).

I suppose whether NAEP is a high stakes test depends on your perspective. It may not be high stakes for the students taking it, but given that a whole state may be judged by these scores, that seems high stakes to me.

As to trusting psychometricians over Mr. Farley, it seems to me we have here a firsthand account of what is happening in the sausage factory. We can keep eating what they are feeding us, but I think his account should be heeded. I think we should be exploring alternatives -- building up from the classroom.

"I suppose whether NAEP is a high stakes test depends on your perspective. It may not be high stakes for the students taking it, but given that a whole state may be judged by these scores, that seems high stakes to me."

By that definition, I cannot think of any test, from Monday's spelling pre-test to the AP Calculus exam that is not "high stakes" for somebody.

Regarding Mr. Farley, and psychometricians, we could solve your sausage factory opinion of testing by eliminating his job and sticking to machine scored assessments (in fact, there are machine scoreable essays, which are apparently valid and reliable, and used in some states). The problem would be the loss of some validity in order to increase reliability. Are you ready to go there?

Before we start hanging any new "stakes" on classroom assessments, I think we need to know a whole lot more about THEIR validity and reliability. I am not familiar with your research that indicates that teacher grades are a good predictor of college success. I do know that they are among the least realiable indicators of student knowledge of content.

Here is one study that I have seen referenced that supports the predictive value of high school grades:

From the Chronicle of Higher Education:

“The high-school grades of University of California students are better predictors of their success in college than are their SAT scores, according to new study by Saul Geiser and Maria Veronica Santelices, of the university’s Berkeley campus.”

“The study examined the fates of nearly 80,000 students who entered the university system as freshmen from 1996 to 1999.”

“Standardized-test scores do add a “small but statistically significant improvement in predicting long-term college outcomes,” the authors concede. But they argue that SAT scores are so intertwined with students’ socioeconomic status and add so little predictive value that their use in college admissions should be minimized. “High-school grades provide a fairer, more equitable, and ultimately more meaningful basis for admissions decision-making,” they write.”

I teach high school Physics and Chemistry in Michigan, and I would say from my own experience,standardized testing forced on school districts by the state government has been more of a detriment than an aid to improving actual learning of content in our school. A lot of energy is wasted trying to raise student scores on MEAP tests, ACTs, and SATs over recent years. Each year's results are compared to a test from the previous year that has different scoring rubrics, emphasis, and students, then our teachers are asked to concentrate on helping the students do better on tests they'll take.
I took heart in reading in one previous comment that high school grades are often better indicators of college succes than standardized test scores. Let's spend the time and energy to train teachers to be better at teaching and assesment rather than learning how to write "ACT type questions" so the students can practice on them.

We need to move away from high stakes testing and focus on the whole child. Our children need to be exposed to history, science, and the arts to help them become complete people. The public needs to decide what is more important - critical thinking skills or bubble coloring.

We have lost focus on what schools are for. The original purpose for schools was to create citizens who would continue a democratic society. They needed to be able to read, write, and THINK.

Our students are great at filling in bubbles, but are they going to be able to make sense of the propositions that they are going to vote on when they are 18? Are they going to be able to critically think about the consequences of their choice? No, because they only know how to fill in a bubble.

We are doing ourselves a great disservice by only focusing on reading and math tests. We need people who are able to think for themselves, who are able to analyze information, who are able to question, who are able to think creatively.

It is time to "swing" back. Testing is good. Accountability is good. But creating a generation of good test takers is bad.

Comments are now closed for this post.


Most Viewed On Teacher



Recent Comments