« The Miracle Teacher, Revisited | Main | Is Arne Duncan Really Margaret Spellings in Drag? »

We Need Schools That 'Train' Our Judgment

| 14 Comments | No Recommendations

Dear Diane,

Thanks for “nailing” Nicholas Kristof. Another very well-meaning ally. With friends like Kristof we... .

Kristof ought to read Rothstein et al more carefully on the complexity of the relationship between the economy and schooling. The lapse between schooling data and economic data is one source of error. Confusing correlation with causation is another. Besides, if all the people in the world became well-educated, would that be tragic?

What data is Kristof referring to that shows that teachers with a better education and who stay in teaching longer don’t “teach” better? What does “teach better” mean? Suppose it turned out that college degrees don’t raise productivity in engineering, medicine, law, Wall Street, or politics?

There’s been a lot of talk on the business pages of our media about the “data problem.” It ought to give the “data-driven” school reformers pause to reconsider. Maybe we are just creating a bubble that too will burst if we continue to base our actions on the belief that scores on standardized instruments are evidence of success.

Even the technical meaning of a “good test” is open to dispute. Margo (a frequent commenter on our blog) states that at least they are more “reliable” than professional judgment. How can she tell?

I want a nation of citizens who are less inclined to think that the “truth” can be captured in one of four feasible answers—a,b,c, or d. I mention “feasible” because in constructing such tests it is crucial not to have one “right” and three absurd alternatives. They are designed to produce differentiated responses. There’s a peculiar science/logic to this arrangement. On both IQ/ability and traditional achievement tests we’re promised ahead of time a population that fits a normal curve. We’ve replaced these in K-12 schools with judgments about benchmarks, which must still rest on a numerical rank order based on a, b, c, d. The big new invention is that there is often no technical back-up for the validity or reliability of such exams. Many big-name psychometricians shun them.

All “reliability" tells us is that the student would get a similar score on a similar test if given at another time or place. The “garbage in,” “garbage out” dilemma. All scores on old or new tests also have a substantial measurement error. Like Wall Street's numbers, we have no independent basis for relying on the scores—validity is in the eye of the beholder (human judgment). Since they correlate with family income, wealth, parental education, and race and are gatekeepers to prestigious institutions and jobs, it’s a circular game.

When parents told me that their child seemed to read well, but scored poorly, they often believed the indirect evidence (test score) and not the direct evidence (listening to their children read). Parents had been trained to distrust judgment and rely on “real evidence”. That’s how I became a testing maven—when my own 8-year-old son “failed” a 3rd grade reading test even though I “knew” he could read fluently.

We need schools that “train” our judgment, that help us become adults who are in the habit of bringing judgment to bear on complex phenomenon. This includes judging which expertise to “trust”—and defending such choices; it includes being open-minded about one’s judgments, as well as one’s favorite experts. It includes acknowledging that even experts must live with a substantial degree of uncertainty.

"I think we are lying to children and families when we tell children that they are meeting standards and, in fact, they are woefully unprepared to be successful in high school and have almost no chance of going to a good university and being successful." —Arne Duncan, U.S. News and World Report

Duncan seems more comfortable lying with statistics? What, after all, is his definition of a “good college” but one that’s hard to get into—thus consigning most people to failure. Similarly what’s his definition of “success”? Doing “better than average”? Thus consigning most of us to failure. I know too many successful adults who don’t meet Duncan’s definition to call such teachers liars.

Some folks (Diane?) think that my skepticism about tests should make me a fan of a single national test. All of these dilemmas are even worse, after all, when we are dependent on data collected from states using 50 different tools, each reporting scores in different ways, and on and on. I think I’d be a sympathizer if we could go back to the old NAEP tests, developed in the 1930s, which tried to use sampling in the interest of collecting better data—standardized prompts and open-ended tasks—that opened the door to more authentic responses. Such an instrument could provide some common data out of which we might develop uncommon responses and interpretations. But NAEP went from a promising beginning to being another standardized test. In our eagerness for simpler data, when only complex data will do, we lost a useful research instrument.

We turn classroom teaching into a “test-like” setting. When we script teaching and pre-code children’s responses we have simply another form of standardized testing. I see it daily: when teachers tell children to put on “their thinking caps.” The kids shift into that special “school-mode” of so-called thinking: trying to guess what answer the teacher wants to hear.

It’s not what was needed in the 19th Century, or the 2lst.


P.S. Take a look at Imagining Possibilities on just this subject.


An on-topic new report from Fordham Foundation this morning, The Accountability Illusion...



I once had a third grader, Caitlin, who was probably a better (more fluent) oral reader than half the faculty in the school. At first impression one would have clearly called this kid gifted. She was seemingly a great little reader - until I started asking her questions about what she just read. I couldn't believe it. It was if she had never read the story/passage at all. She knew nothing of what she had just read so perfectly out loud.

It certainly opened my eyes to the value of objective and quantitative data. Caitlin had been schooled at home on oral reading. Shortly after I met the parents I was able to figure out why her "reading" was such a mystery.

I know the experience you're referring to, but I'm unclear on the reference to objective data. It seems like you did not need any quantifiable data to figure out if your student was comprehending what she read. You just needed to ask her, which is what teachers do. What am I missing?

Hi All.... a taste of spring weather here in southern new jersey!!

Maybe it was my eight graders assignment last night concerning reconstruction after the civil war....maybe it was actually looking around at our uraban schools in nj for years... maybe it was reading Rothstein... Kozol...Kohl and you Deb!!!

But it seems our country is really not interested in really looking and "seeing" who it is we are "doing" our educational reform on!

Here is an example from Chicago:

Segregated African American Elementary Schools which could have faced reconstitution at the end of the

1. Bass (1140 W. 66th St. 60621); 508 total students; 99.8 percent African American students.
2. Bethune (3030 W. Arthington 60612); 353 total students; 99.4 percent African American.
3. Bontemps (1241 W. 58th St. 60636); 393 total students; 98.7 percent African-American.
4. Bradwell (7736 S. Burnham 60649); 770 students; 99.9 percent African American.
5. Brunson (932 N. Central Ave. 60651); 806 students; 96.4 percent African American. 6. Copernicus (6010 S. Throop 60636); 346 students; 99.4 percent African American
7. Curtis (32 E. 115th St. 60628); 470 students; 97.7 percent African American
8. Doolittle (535 E. 35th St. 60616); 431 students; 99.1 percent African American.
9. Dulles (6311 S. Calumet 60637); 429 students; 99.8 percent African American.
10. Dumas (6650 S. Ellis 60637); 382 students; 99 percent African American.
11. Earle (6121 S. Hermitage 60636); 394 students; 100 percent African American.
12. Fermi (1415 E. 70th St. 60637); 239 students; 98.7 percent African American
13. Fuller (4214 S. St. Lawrence 60653); 283 students; 99.6 percent African American
14. Fulton (5300 S. Hermitage 60609); 654 students; 82.3 percent African American; 17.5 percent Hispanic American.
15. Harvard (7252 S. Harvard 60620); 519 students; 98.7 percent African American
16. Henderson (5650 S. Wolcott 60636); 461 students; 98.7 percent African American.
17. Hinton 644 W. 71st St. 60621 421 99 percent African-American.
18. Holmes (955 W. Garfield Blvd. 60621); 462 students; 99.6 99 percent African-American.
19. Howe (720 N. Lorel 60644); 540 students; 99.4 99 percent African-American.
20. Johnson (1420 S. Albany 60623) 281 students; 99.3 99 percent African-American.
21. Kershaw (6450 S. Lowe 60621); 273 students; 98.9 99 percent African-American.
22. Key (517 N. Parkside 60644); 389 students; 98.5 99 percent African-American.
23. Lathrop (1440 S. Christiana 60623); 322 99.1 99 percent African-American.
24. Lavizzo (138 W. 109th St. 60628); 506 students; 98.6 99 percent African-American.
25. Lewis (1431 N. Leamington 60651); 813 students; 86.1 99 percent African-American; 13.5 percent Latino (110 students who are minorities but are not black).
26. Libby (5300 S. Loomis 60609); 569 students; 92.8 99 percent African-American; 7.2 percent students who are Latino (41 students who are minorities but not black). 27. May (512 S. Lavergne 60644); 588 students; 98.8 99 percent African-American.
28. McKay (6901 S. Fairfield 60629); 1,052 students; 89 99 percent African-American; 11.1 percent Latino (116 students who are minorities but not black). 29. Medill (1301 W. 14th St. 60608); 147 students; 100 99 percent African-American.
30. Morton (431 N. Troy 60612); 284 students; 94.7 99 percent African-American.
31. Nash (4837 W. Erie 60644); 584 students; 99.1 99 percent African-American.
32. O'Keefe (6940 S. Merrill 60649); 675 students; 100 99 percent African-American. 33. Park Manor (7037 S. Rhodes 60637); 378 students; 99.7 99 percent African-American.
34. Parkman (245 W. 51st St. 60609); 156 students; 87.8 99 percent African-American; 11.5 percent other minorities (18 students this school year).
35. Reed (6350 S. Stewart 60621); 297 students; 100 99 percent African-American.
36. Ross (6059 S. Wabash 60637); 411 students; 100 99 percent African-American.
37. Schiller (640 W. Scott 60610); 190 students; 99.5 99 percent African-American.
38. Sherman (1000 W. 52nd St. 60609); 584 students; 99.1 99 percent African-American.
39. Smyth (1059 W. 13th St, 60608); 592 students; 98 99 percent African-American.
40. Wentworth (6950 S. Sangamon 60621); 427 students; 98.6 99 percent African-American.
41. West Pullman (11941 S. Parnell 60628); 424 99.5 99 percent African-American.
42. Yale (7025 S. Princeton 60621); 294 students; 99.7 99 percent African-American.

Total 19,097 students in the 42 schools that could have faced 'turnaround'
97.7 percent African American.

Really amazing...really sad.... mostly goes unsaid.....


Deborah: I've said for years, tongue only partly in cheek, that I'd like to see a law passed that only politicians who have proved that they know and understand something about assessment in general and standardized testing in particular should be allowed to write, vote on, or sign any bill having anything to do, directly or indirectly, with assessment in public schools. And obviously they should have to prove their knowledge and understanding by means of a multiple-choice, standardized test, right? I nominate you to the test development committee :-)


Sorry for any confusion. I was not asking Caitlin how the main character felt about situation x or y in the story. I asked her who the main characters in the story were and where or when the story took place. These were to obtain objective information which helped me evaluate her reading. There was only one correct answer for each question - no fudge factor involved. Perhaps it's simply our interpretation of the term objective or quantitative which is causing the confusion.

Again, my story was an attempt to remind Deborah (not that she should need reminding) that oral reading is not the only variable teachers use to evaluate a youngster’s reading. A number of other factors come into play as well.


First, let me say I have been inspired by your goals for education, your work at real schools working with read kids and your idea of Habits of Mind. Also, I should say that I am a giant critic of tests.

However how I applaud your work and your goals, I often find your criticism of tests to be misguided.

First of all, I think that you have a tendency to confuse or conflate -- a little bit -- reliablity and validity in your criticism. (Though you clearly do know the technical difference.)

Reliabillty, as your say, is about gettting the same score again. It is about how consistently a test measures whatever it is testing. For the tests that teachers give in schools, we'd want tests that are reliable through the day, so that the kids who have a particular class at a different time of the day would not have an advantage or disadvantage do that. If a teachers had multiple forms, so that later classes do not have the same questions -- to prevent cheating -- we'd want the forms to be reliable and equivilent. There are all sorts of things to worry with the tests that are used in school every day, and none of them are addressed on a regular basis.

Here is an example of the kind of reliablity concerns we perhaps ought to be addresses: You and I are both teaching The Grapes of Wrath this year, and we each give a traditional sit down test on it, and assign some kind of project for it. How well would my kids do your test and project, and how well would your kids do on mine? Add to that the question of inter-rater reliability (i.e. would be grade the same tests and projects the same, even if we used the same rubrics?)

So, we don't know how reliable those tests are. But we can be reasonably certain that they vary. The fact that teachers are not taught about test design and reliablity, we can be reasonably certain that at least some of them handle this issue poorly, perhaps even most of them.

And here's the kicker, the idea that made me realize how import reliability actually is (even though I am HUGE critic of the the kinds of tests that have high reliability): "Reliability is an upper bound on validity."

What does that mean? Well, if a test is not reliable, then the scores vary a bit -- or even a lot -- when they shouldn't. And the more they vary, the less you can be sure what you are actually measuring. Are you measuring time of day? Disruptive noises in the hall? Responsiveness of the grader to the particular themes in the work compared to other themes in the work? Pen selection? Etc. etc..

Deborah, I think that you and I agree that these standardized and psychometricized tests fail in measuring what we want them to measure. My big criticism is that we don't know how to measure the things that we really most ought to be measuring -- like the habits of mind that you and Ted Sizer champion. And if we don't know how to measure what is worth measuring, then why both with all the rigamole of psychometrics? Right?

But reliability IS an upper bound on validity. Teachers DO evaluate students every day, formally or informally, and those methods of evaluation is not very reliable (I know that I have not shown that in this post. That's for another day.)

So, do we go with tests that are known be reliable, but are known not to measure what we truly and deeply care about? Or do we sacrifice reliability for tests that have better facial validity (i.e. look on their face to be measuring the right thing)? The killer for folks like us -- folks who care about things like Habits of Mind -- is that without realiability we can't have validity.

And then we get to practicality. The best way to improve the reliability of a test is to add more items. And performance assessment or open-ended questions take much more time than multiple choice questions, leading to tests with far fewer questions. To get the kind of reliabillity that a multiple choice test delivers, the kids would have to spend a week to answer all the open-ended response questions, rather than the hour or two that the multiple choice test takes.

But poses its own problems.

And so we have an enormous conundrum. No one is saying that we kids should not be evaluated. Rather, people like us decry multiple choice tests for their failure to tests what matters. But our tests are not -- almost cannot be -- reliable, and reliability IS an upper bound on validity. The failure of THOSE tests that we hate does not in any way prove the superiority of our assessments. Our assessments have their own flaws.

I don't have an answer here. What I do know is "I know it when I see it" is not a sufficient standard, because I cannot be everywhere. Even if I could trust that I know it when I see it, that doesn't mean that I can trust other teachers to be able to see it as well as me. Now, I don't think that tests which are known to test things that are not of central importance are not worth our time or money -- as they becomes the standards the the objectives. But if our assessments really don't do any better than "I know it when I see it," then our assessments are just as flawed.

While waiting to make a “public comment” to the National Math Advisory Panel at their Stanford University regional meeting I witnessed Jim Milgram (ex Stanford math professor and IES oversight board member) present compelling evidence that 20% of the NAEP math questions were either not math questions or had incorrect answers. You can download the minutes of the meeting to read Jim Milgram’s presentation. I have not found any evidence that this situation has either been addressed, denied or remedied by NAEP.

I learned quickly not to trust the results of the NY state ELA exam when I looked at the reality of my students reading levels and the scores they received. I had a student reading on a fifth grade level score a 3(on grade level) on the seventh grade exam. I had a gifted student, who was reading above grade level in my class who had only received a 2 the year before (A high two, and thus she was one of the students I was supposed to focus on). I understand that no test is perfect. But if we really want to understand students' abilities and how much they are learning we need multiple assessments. I learned more from sitting with a student, listening to them read and asking them some basic questions then from the 1, 2, 3 or 4 they were labeled with. As a teacher who read YA literature, I could tell from a two second conversation if a student was really reading and understanding a text.

Ulimately I believe the most powerful tests are those that are developed to assess what students have actually learned. I have no problem with acknowledging the results of some other test (whether it be national or state). But it shouldn't be given carte blanche control of dictating a student's ability.

As always, these letters make for fascinating reading. This one in particular is so highly quotable it makes me want to manufacture magnets for school filing cabinets. I hope we can begin to honestly question the effective-successful measures used to distribute children, teachers, administrators, and districts on a curve and start to pay some serious attention to the statistics provided by Mike about Chicago. I am reminded of a conference session I attended many years ago, when we were asked to break into groups and compare two pieces of student data. The first was a page of answers to a multiple choice standardized test. The other was a page of essay writing. Our task was to list what we could say with a comfortable degree of certainty about what the child in question knew and could do. On the test, all we were comfortable listing was that the child could correctly bubble in a single answer to each question. There was no way to tell if it was guesswork or not. On the essay, we generated a long list of accomplishments, based on clear evidence we could all agree on. The endless conundrum of such an exercise is the piece we didn't do - a cost comparison.

Hi All.... hope this finds you well.

My take....we need to be thinking much differently or we will continue down a road that goes no where! Standardized measures will only accomplish what we have already had in America for way too long!

Lani Guinier has written,
"What has happened is that the testocracy has been manipulated to reproduce and credentialize the already existing social hierarchy."

- Urban students are likely to have higher rates of mobility, absenteeism, and poor health. They are also less likely to have health coverage, which decreases attendance and reduces funding based on attendance-based formula.

Children living in urban areas are much more likely to be living in poverty than children in other types of communities. In 1990, 20% of children nationwide were living in poverty. However, 30% of children living in urban areas lived in poverty, compared with only 13% of those in suburbs and 22% of those living in rural areas (Krantzler et al, 1997).

Approximately, 40% of urban students attend schools with high poverty concentrations. This is a large number compared to the 10% of suburban students and 25% of rural students who attend such schools (Reyes et al, 2004).

Poverty comprises the “600 pound gorilla” that most affects American education today (Berliner, 2005). The concentration of poverty in a school is a major factor associated with student academic achievement.

In fact, according to Krantzler et al (1997), the “relationship between school poverty concentrations and school academic achievement averages is stronger than the relationship between individual family poverty and individual student achievement.”
Poverty is strongly correlated with race and ethnicity. Consequently, African-American and Hispanics are greatly overrepresented in the groups that suffer severe poverty in urban areas (Berliner, 2005).

In 1999, 17% of Americans age 5 to 24 were from families which the primary language spoken was not English. Sixty-five percent of these students’ families speak Spanish (Slavin, 2005). Due to the large Hispanic population, urban public schools have higher proportions of students with limited-English proficiency.

According to Krantzler et al (1997), in 1993-94, compared to the national average urban schools had two times the proportion of students with limited-English proficiency.

Statistics from the U.S. Department of Education, Office for Special Education Programs for Fall 2006 show that although African Americans represented just 15% of all students, they represented 21% of students in the special education category of specific learning disabilities, 29% in the category of emotional disturbance, and 33% in the category of mental retardation.

The dropout rate among minority children with disabilities has been 68% higher than whites. More than 50% of minority students in large cities drop out of school.

As Kozol suggests:
Still Separate, Still Unequal:
America's Educational Apartheid

Until we have the courage to look at this.... the test and measure discussion seems like a somke screen.

Is America ready to talk about this?


It seems to me that all tests have their flaws, but some can be useful if we take them with some humility.

Marianne Moore's wry lines about poetry seem relevant here:

"I, too, dislike it: there are things that are important beyond all this fiddle.
Reading it, however, with a perfect contempt for it, one disovers in
it, after all, a place for the genuine."

I've been thinking about humility (and the lack thereof) in education. Whatever measure we take to improve schools, we should regard it as partial, not complete; a test, likewise, would provide partial insight.

I am not so sure that (well-written) multiple-choice tests necessarily impede thought any more than "authentic" assessments might. (Essay rubrics can be maddening!) But any given test obstructs thinking if they are given more importance than they merit: if, for instance, it is void of subject matter but weighty with consequences.

It seems we could have better, clearer tests and also take the results in stride--analyze them, glean what we can from them, improve our instruction accordingly, but stop this frenzy and finger-pointing.

Brian Rude wrote in a recent comment that "real educational improvement will be a gradual process of refinement, not revolution." I agree. But refinement requires humility. It is easier to call for sweeping change, grand reform, success for all. It is harder (but more effective) to recognize the imperfection of all that we do, to regard it closely, to work within the fraction. This brings to mind the end of a poem by Tomas Venclova:

"Better to forget. All is untruth, after all.
Experiences, approximations, beginnings.
I can no longer say what touches us:
Perhaps just air, sprouting beneath the snow,
Having taken the night to learn by heart
Our lofty science, rife with imperfections."

Diana Senechal


I wish I said that. And that just applies to your comments. Thanks even more for for the wisdom of Moore and Vencola.

Dear one an all. It's fun to read all your ceitiques. Thanks.

Paul H. Mike's point is well taken, if one "joins" the student in reading--engage together. Of course, the "asking" part also paralyzes some kids. It's also true that reading aloud to "ourselves" is a way to confuse understanding. he two acts are hard to do together--as I used to notice when I read aloud to my children at night! We focus too much on reading aloud--which we might both agree with. (Read Frank Smith's work on reading if you get a chance--it's interesting.)

Ah--segregation. Isn't it amazing what 50 plus years has not changed? I remember telling my elementary school kids about Brown vs Board of Ed--and their puzzlement--since there wasn't a single white kid in our entire building. l'll bet the social class (wealth, status) segregation mirrors the data you presented. African American kids don't need white kids in order to learn--but a healthy racial, ethnic and class diversity helps us all develop the common experiences to cross paths more successfully.
Jean--I once invented the "Meier law"--that any legislature that adopts required high stakes testing laws must also take such tests, and have the results made public. Ditto for all adults who administer such tests.
Ceolaf - read the two chapters in In Schools We Trust on tests--if you would--and help me see what our differences are, It's such an important subject when they stand-in for the purpose of schooling.
Yes Alexandra, I agree that teacher's looking at samples of student work together provides critical information--and substantially improves "measurement error"--which exists with both indirect and direct data--so-called "objective" and subjective. I think more of this now occurs--whch is perhaps one of the few encouraging signs I see in our schools.
Finally, Diana. Ah yes. rubrics. They are a rough attempt to keep us on the same page, and sometimes that's a plus. Probably it' all a question of balance--between various sources of information, and how they are used. If we could once again take it "in stride" that we aren't all of one piece, and that the real task is pulling together disparate "evidence" and making informed, but fallible, judgments about them. If we had time to sit down and go over standardized tests with ids, respectfully listening to their logic, they can turn out to be quite insightful. The standardization is then an asset--it reminds us of how ordinary children "think", not just about their ignorance.

Carry on!


Comments are now closed for this post.


Most Viewed on Education Week



Recent Comments