« An Immodest Proposal | Main | Cool People You Should Know: Mike Rose »

Educational Testing: A Brief Glossary


While you’re waiting for Dan Koretz’ book on testing to arrive – I think eduwonkette and I should get some kind of consideration for shilling for this book so often here – here’s a brief skoolboy’s-eye view on testing. Actual psychometricians are welcome to correct what I have to say.

Tests are typically designed to compare the performance of students (whether as individuals, or as members of a group) either to an external standard for performance or to one another. Tests that compare students to an external standard are called criterion-referenced tests; those that compare students to one another are called norm-referenced tests. Even though criterion-referenced tests are intended to hold students’ performance up to an external standard, there is often a strong temptation to compare the performance of individual students and groups of students on such tests, as if they were norm-referenced.

A typical standardized test of academic performance will have a series of items to which students respond, generally either in a multiple-choice or constructed response format, which means that students are constructing a response to the item. There’s usually only one right answer to a multiple-choice item, whereas constructed-response items may be scored so that students get partial credit if they demonstrate partial mastery of the skill or competency that the item is intended to represent. For any test-taker, we can add up the number of right answers, plus the scores on the constructed-response items, to derive the student’s raw score on the test. A test with 45 multiple-choice items would have raw scores ranging from 0 to 45.

For individual test items, we can look at the proportion of test-takers who answered the item correctly, which is referred to as the item difficulty or p-value, which has nothing to do with the p-values used in tests of statistical significance, but rather the proportion (p) of examinees who got the item right. Some test items are more difficult than others, and hence items will have varying p-values.

Raw scores are rarely interpretable, in part because they are a function of the difficulty of the items. For this reason, they are typically transformed into scale scores, which are designed to generate a score that will mean the same thing from one version of a test to the next, or from one year to the next. The scale for scale scores is arbitrary; the SAT is reported on a scale ranging from 200 to 800, whereas the NAEP scale ranges from 0 to 500.

The process of transforming raw scores into scale scores is computationally intensive, generally using a technique known as Item Response Theory (IRT), which simultaneously estimates the difficulty of an item, how well the item discriminates between high and lower performers, and the performance of the examinee. An examinee who successfully answers highly difficult items that discriminate between high and low performers will be judged to have more ability, and hence a higher scale score, than an examinee who gets the difficult items wrong.

There’s no one right way to transform raw scores into scale scores, and it’s always a process of estimation, which is sometimes obscured by the fact that scores are reported as definite quantities. (A little skoolboy editorializing here…) The expansion of testing hastened by NCLB has placed a lot of pressure on states, and their testing contractors, to construct scale scores for a test that represent the same level of performance from one year to the next (a process known as test equating). Much of this is done under great time pressure, and shielded from public view. The process is complicated by the fact that states typically don’t want to release the actual test items they use, because then they can’t use them in subsequent assessments as anchor items that are common across different forms of a test, since students’ performance on such items could change due to practice. Some tests are vertically equated, which means that a given score on the fourth-grade version of a test represents the same level of performance as that same score on the fifth-grade version of the test. In a vertically-equated test, if the average scale score is the same for fourth-graders as it is for fifth-graders, we’d infer that the fifth-graders haven’t learned anything during fifth-grade.

Proficiency scores represent expert judgments about what level of scale score performance should describe a student as proficient or not proficient at the underlying skill or competency that the test is measuring. For example, NAEP defines three levels of proficiency for each subject at each of the grades tested (4th, 8th and 12th): basic, proficient, and advanced. Cut scores divide the scale scores into categories that represent these proficiency levels, with students classified as below basic, basic, proficient, or advanced. These proficiency scores do not distinguish variations in students’ performance within the category; one student could be really, really advanced and another just advanced, and whereas a scale score would record that difference, a proficiency score would simply classify both students as advanced. The fact that proficiency levels are determined by expert judgment, and not by the properties of the test itself, means that they are arbitrary; the level of performance designated as proficient on NAEP may not correspond to the level of performance designated as proficient on an NCLB-mandated state test. Many researchers (including Dan Koretz, eduwonkette, and me) are concerned that the focus on proficiency demanded by NCLB accountability policies has the unintended consequence of concentrating the attention of school leaders and practitioners on a narrow range of the test-score distribution, right around the cut score for the category of “proficient,” to the detriment of students who are either well below or well above that threshold. Such a focus is a political judgment, not a psychometric one, and there are arguments both for and against it.

I'll update this as more knowledgeable readers weigh in. If experts in measurement were to judge proficiency thresholds for knowledge about testing, I'd probably be classified as basic; Dan Koretz is definitely advanced. For a lively and readable treatment of these kinds of issues, get his book!


This is an amazing reference - thanks for putting it together!

And I agree we deserve a cut.

Skoolboy--thanks so much for putting this together and trying to provide a basic understanding. I know enough about equating to know that it exists, and that it is very complex, and further that it makes somewhat unlikely the theories that this year's test is simply easier or harder than last years.

I have two question/comments. One is about vertical equating. You suggest that if the fifth graders achieve the same mean score as the fourth graders in a vertically equated system we would assume that they had learned nothing. Wouldn't we assume that they had rather advanced one year in their learning? (I am assuming that there is both a fourth and fifth grade test--and they are not just repeating the fourth grade test). They may not have advanced in the sense of moving closer, as a group to proficiency, but they would have advanced by one year, would they not?

The second is about your comment about the professional judgement aspect of setting cut score. You suggest that they are arbitrary. I would consider the old stairway method (throw all the papers up the air and everything that lands on the top step is an A) to be arbitrary. I would agree that the methodology relies on subjective judgments--but my understanding is that even within this arena there is a science and a best practice methodology to ensure some reliability.

BTW--I don't entirely disagree with the suggestion that setting proficiency levels leads to some initial concentration on the "low hanging fruit," of students who hover right around the mark. But the system has a number of safeguards against this being the only focus for attention (disaggregation of data, AYP for subgoups, movement towards growth models, calculation of performance indicators that take account of student movement from level to level).

Great educational post.

I also highly recommend a report done by Greg Cizek in the late 90s.


It was an essential primer for me when I was starting as a research assistant at Brookings' Brown Center. I think it is still terrific and stands the test of time. Cizek also offers some nice context leading up to the NCLB accountability era.

- Paul

This is a truly useful resource and should be required reading for all new teachers. Bravo!

This is great.

Hi Margo/Mom,

The idea of vertical equating is to place test-takers of different ages, grades and abilities on a common scale. The key to doing so is a sufficient number of items common to the different tests to be able to link the scores across different grade levels. By design, then, a given scale score on a vertically-equated test really does represent the same level of proficiency, regardless of the age or grade of the test-taker. On a vertically-equated test, if students have increased their learning from grade four to grade five, their scores should go up.

For example, the Stanford Achievement Test is vertically equated. Virginia used to administer the Stanford 9 test annually. In 2002, the statewide average scale score for total reading in grade 4 was 634; in grade 6 was 670; and in grade 9 was 704.

Hi Skoolboy,
Let me add to the chorus--thanks for your helpful post. I had the exact same question as Margo/Mom and I still don't get it (ie, am not proficient). Let me ask it this way: say 650 is the "passing" or Level 3 scale score on the NY state test for all grades. That doesn't mean it's the same test for all grades. The state tests different knowledge and skills in each grade, and then tries to equate their difficulty levels so last year's 4th grader at Level 3 stays a Level 3 in 5th grade this year, all else being equal. So then is that test vertically equated? Or does vertically equated apply only to tests like Sanford with a continuous scale?


I may not be proficient on this either, but here's my impression. The test scoring process goes from raw scores to scale scores to proficiency scores. For tests that are vertically equated, the scale score for the fourth grade test represents the same level of performance in the subject being tested as that same scale score for the fifth grade test. But the test content will change from year to year, as we expect fifth-graders to know more than fourth-graders. The cut scores for proficiency pertain to a judgment about what constitutes proficiency at a particular grade level. For this reason, we would not expect proficiency scores to be vertically equated such that Level 3 in the fourth grade represents the same proficiency as Level 3 in the fifth grade.

But here's something important that I should have clarified earlier: In New York State, the grades 3-8 ELA and math tests are not vertically equated. This means that it's not possible to compare scale scores across grade levels.

This is helpful and interesting--thanks! I was particularly interested in item difficulty. Do we have stats on item difficulty for the NYC tests?

Poorly worded (or downright flawed) test questions may be more difficult than others--in any case, they set a dubious example for students. I found an egregious grammatical error on the 2007 7th-grade ELA exam.

Question 5 on the 2007 test (http://www.nysedregents.org/testing/elaei/07exams/gr7bk1.pdf) reads:

The cause of the conflict between Alex and Kendall is due to

(A) the rumors they have heard
(B) the schools they attend
(C) the conversation they have
(D) the mind games they play

If the cause of the conflict is due to something, that thing is the cause of the cause. The writer probably meant, "The conflict between Alex and Kendall is due to..." or "The conflict between Alex and Kendall was caused by...." It seems that clear language is not top priority here.

Anyway, thanks for this informative glossary, and I look forward to reading the Koretz book!


I don't think we have data on item difficulty for the 2008 administration of the NY state math and ELA yet. The CTB/McGraw-Hill technical reports on the 2007 administrations were published in December, 2007. For the 2007 administration, item difficulties (p-values) for the ELA tests grades 3-8 ranged from .36 to .95, with a mean around .74. Item difficulties for the math tests grades 3-8 ranged from .32 to .98, with a mean ranging from .79 for grade 3 to .61 to grade 8.

Skoolboy, you provided a nice, clear summary of these concepts.

Maisie and Margo/Mom raised questions about the meaning of performance standards and their comparability across grades. I’ll take a stab at a few of these issues. It’s hard to give a good explanation of this without going into some detail about how standards are actually set, so at the risk of sounding like the Car Guy’s Shameless Commerce Division, I urge you to read Chapter 8 of Measuring Up, which does explain this.

With respect to comparability across grades: in most cases, the panels that are given the task of setting performance standards examine only one grade and subject, without comparison to adjacent grades. Therefore, the fourth-grade standard-setters may end up setting more (or less) ambitious standards than the fifth-grade standard-setters. These standards-based scales therefore are not vertically linked, and you cannot safely conclude that a change in percent proficient from one grade to the next really indicates improvement or deterioration in performance. There have been some recent efforts to link performance standards across grades, but for the most part, if you want to use scores to evaluate how much kids have learned from one year to the next, you need to use vertically linked scale scores rather than performance standards.

However, this is not just a matter of choosing a scale. To create a sensible vertical scale, you need to construct tests in certain ways—most often, by including overlapping content in adjacent grades. Some traditional tests (e.g., the ITBS, Terra Nova, and SAT-10) are designed this way. Some state tests are not.

The issue of arbitrariness is more complicated, and this in particular is hard to abstract to a few sentences. Many people believe that standard-setting procedures somehow uncover an underlying, “real” standard. This is not at all the case. The common procedures impose a process—in most cases, a very complex and somewhat indirect one—for imposing the judgments of a panel of judges. There are numerous options for doing this, and while there is no pressing substantive reason to prefer one of them over the other, they unfortunately often yield very different results. What are you to make of it if your state, using Method A, announces that 56% of students are Proficient, when Method B would have yielded 40%? This is what I mean by arbitrary: the arbitrary choice of standard setting methods—as well as numerous arcane detailed of the process—can produce very different answers. I wrote in Measuring Up, only half in jest, that “The old joke holds that there are two things no one should see being made, laws and sausages. I would add performance standards.”

An aside: this is not reliability, which is raised by Margo/Mom. In common language, we use “reliability” to mean all sorts of things, and parents often believe that a test score that is “reliable” is therefore “accurate,” “valid,” and “unbiased.” Not so. In testing, “reliability” refers to something very specific and much narrower: the consistency of results across different forms of a single test or occasions of testing. (The SAT is a highly reliable test, which means that most of the time—but not all of the time—a student who takes it on two occasions will obtain reasonably similar scores.) A state’s test scores can be highly reliable, even if the performance standards are arbitrary. (This is explained in Chapter 7 of Measuring Up, “Error and Reliability: How Much We Don’t Know What We’re Talking About.)

Thanks for posting, Dan. (Is 1% of your net royalties too much to ask for my shilling for your book? Or at least a cyberbeer?) I knew when I said that proficiency standards are arbitrary that there'd be a risk of misunderstanding. Readers, Dan addresses this more thoroughly in his book, Measuring Up.

A cyberbeer, certainly, or if you pass through Cambridge, a real one. I appreciate your getting the word out. I hope your readers find the book useful--and in some places, even fun.

Strictly speaking, a test is not norm-referenced or criterion-referenced, as is commonly believed. Rather, there are norm-referenced and criterion-referenced interpretations of a test. The terms refer to the reference point used to interpret a score: either a set of standards (criteria) based on the content or the scores of a comparison group (norm) that is meant to represent a group that is meaningful to stakeholders, such as "the nation's 3rd graders" or the "5 year-olds in Kentucky."

A test can be designed with one or the other interpretation in mind. Hence non-psychometricians may be using shorthand when they call a test norm- or criterion-referenced. In fact, you can use either norm groups or content-criteria or both for any test. The appropriateness of either in each case is somewhat subjective.

I am research scholar I have some problems for data analysis i.e.
How i Compare the item difficulty indices of different formats of test items? Whenever i used same question in all three formats of test.

Comments are now closed for this post.


Recent Comments

  • shanti: Mam I am research scholar I have some problems for read more
  • Ed Researcher: Strictly speaking, a test is not norm-referenced or criterion-referenced, as read more
  • Dan Koretz: A cyberbeer, certainly, or if you pass through Cambridge, a read more
  • skoolboy: Thanks for posting, Dan. (Is 1% of your net royalties read more
  • Dan Koretz: Skoolboy, you provided a nice, clear summary of these concepts. read more




Technorati search

» Blogs that link here


8th grade retention
Fordham Foundation
The New Teacher Project
Tim Daly
absent teacher reserve
absent teacher reserve

accountability in Texas
accountability systems in education
achievement gap
achievement gap in New York City
acting white
AERA annual meetings
AERA conference
Alexander Russo
Algebra II
American Association of University Women
American Education Research Associatio
American Education Research Association
American Educational Research Journal
American Federation of Teachers
Andrew Ho
Art Siebens
Baltimore City Public Schools
Barack Obama
Bill Ayers
black-white achievement gap
books on educational research
boy crisis
brain-based education
Brian Jacob
bubble kids
Building on the Basics
Cambridge Education
carnival of education
Caroline Hoxby
Caroline Hoxby charter schools
cell phone plan
charter schools
Checker Finn
Chicago shooting
Chicago violence
Chris Cerf
class size
Coby Loup
college access
cool people you should know
credit recovery
curriculum narrowing
Dan Willingham
data driven
data-driven decision making
data-driven decision-making
David Cantor
Dean Millot
demographics of schoolchildren
Department of Assessment and Accountability
Department of Education budget
Diplomas Count
disadvantages of elite education
do schools matter
Doug Ready
Doug Staiger
dropout factories
dropout rate
education books
education policy
education policy thinktanks
educational equity
educational research
educational triage
effects of neighborhoods on education
effects of No Child Left Behind
effects of schools
effects of Teach for America
elite education
Everyday Antiracism
excessed teachers
exit exams
experienced teachers
Fordham and Ogbu
Fordham Foundation
Frederick Douglass High School
Gates Foundation
gender and education
gender and math
gender and science and mathematics
gifted and talented
gifted and talented admissions
gifted and talented program
gifted and talented programs in New York City
girls and math
good schools
graduate student union
graduation rate
graduation rates
guns in Chicago
health benefits for teachers
High Achievers
high school
high school dropouts
high school exit exams
high school graduates
high school graduation rate
high-stakes testing
high-stakes tests and science
higher ed
higher education
highly effective teachers
Houston Independent School District
how to choose a school
incentives in education
Institute for Education Sciences
is teaching a profession?
is the No Child Left Behind Act working
Jay Greene
Jim Liebman
Joel Klein
John Merrow
Jonah Rockoff
Kevin Carey
KIPP and boys
KIPP and gender
Lake Woebegon
Lars Lefgren
leaving teaching
Leonard Sax
Liam Julian

Marcus Winters
math achievement for girls
meaning of high school diploma
Mica Pollock
Michael Bloomberg
Michelle Rhee
Michelle Rhee teacher contract
Mike Bloomberg
Mike Klonsky
Mike Petrilli
narrowing the curriculum
National Center for Education Statistics Condition of Education
new teachers
New York City
New York City bonuses for principals
New York City budget
New York City budget cuts
New York City Budget cuts
New York City Department of Education
New York City Department of Education Truth Squad
New York City ELA and Math Results 2008
New York City gifted and talented
New York City Progress Report
New York City Quality Review
New York City school budget cuts
New York City school closing
New York City schools
New York City small schools
New York City social promotion
New York City teacher experiment
New York City teacher salaries
New York City teacher tenure
New York City Test scores 2008
New York City value-added
New York State ELA and Math 2008
New York State ELA and Math Results 2008
New York State ELA and Math Scores 2008
New York State ELA Exam
New York state ELA test
New York State Test scores
No Child Left Behind
No Child Left Behind Act
passing rates
picking a school
press office
principal bonuses
proficiency scores
push outs
qualitative educational research
qualitative research in education
quitting teaching
race and education
racial segregation in schools
Randall Reback
Randi Weingarten
Randy Reback
recovering credits in high school
Rick Hess
Robert Balfanz
Robert Pondiscio
Roland Fryer
Russ Whitehurst
Sarah Reckhow
school budget cuts in New York City
school choice
school effects
school integration
single sex education
small schools
small schools in New York City
social justice teaching
Sol Stern
Stefanie DeLuca
stereotype threat
talented and gifted
talking about race
talking about race in schools
Teach for America
teacher effectiveness
teacher effects
teacher quailty
teacher quality
teacher tenure
teachers and obesity
Teachers College
teachers versus doctors
teaching as career
teaching for social justice
teaching profession
test score inflation
test scores
test scores in New York City
testing and accountability
Texas accountability
The No Child Left Behind Act
The Persistence of Teacher-Induced Learning Gains
thinktanks in educational research
Thomas B. Fordham Foundation
Tom Kane
University of Iowa
Urban Institute study of Teach for America
Urban Institute Teach for America
value-added assessment
Wendy Kopp
women and graduate school science and engineering
women and science
women in math and science
Woodrow Wilson High School