Grading Automated Essay Scoring Programs- Part I (@bjfr) (Opinion)

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

Justin Reich

Assistant Professor of Digital Media and Director of the Teaching Systems Lab, MIT

Part I: What are Automated Essay Scoring software programs?

Automated Essay Scoring software programs can grade essays as well as humans.

That was one of the key findings from a new Hewlett Foundation-funded study of Automated Essay Scoring (AES) tools produced by eight commercial vendors and one open source entry from Carnegie Mellon University.

These tools provoke strong reactions of both optimism and scrutiny (and as I suggested in a previous post, even revulsion). Right after the study came out, I had a back and forth with Will Richardson, a pioneering educator in the use of Web 2.0 in classrooms, and Vantage Learning, a producer of one of the AES programs, about the potential and risks of these new tools. I plan on writing a short series of posts about the potential impact of these tools on education, but before we delve into the implications of AES for policy and classroom teaching, we should make sure that everyone understands how they work and what they can do.

I know a little about the kinds of Natural Language Processing algorithms that undergird the AES programs, which helps in making sense of these tools. Fortunately, I’m also at the American Educational Research Association conference this week and there are lots of very smart folks here, including Mark Shermis, the author of the AES study, who sat down with me for an hour on Friday night. Mark offered this preface to our conversation: “I am not affiliated with any of the commercial vendors nor do I see myself as an apologist for the community.”

So with that, lets dive in to understanding how AES works.

Automated Essay Scoring: Faithfully Replicating Human Scores

In describing the state of the art in AES to me, Mark Shermis used two words over and over: “faithfully replicate.” Those are the two key words for understanding AES.

AES programs can’t read in the way that humans. They can’t parse the meaning of words; they can’t understand whether an analogy is silly or apt. AES programs don’t read; they compare. They examine a sample of essays scored by humans (the “training set”) and then they compare the unscored essays to the scored subsample.

This point is incredibly important to understanding what AES programs do. They do not attempt to independently assess the merits of an essay relative to a particular set of standards, rather they attempt to faithfully replicate the assessments of trained expert humans.

How does this work in practice? Someone designs a writing prompt. The prompt can be trivial and mundane or sublimely provocative. It can be broadly philosophical or depend upon specific content and sources. The AES programs place no constraint on the quality of the essay question being asked. Then someone designs a rubric. The rubric can be trivial and mechanical or evaluate sophisticated elements of written communication. Again, the AES programs don’t constrain how the essays are evaluated. Then, kids write essays, and a group of trained, expert raters grade a subsample of the essays based on the rubric. And then, the machines step in.

The machines examine the sample of graded essays and use them to “train” the AES program to identify characteristics of essays that have graded at each of the different levels of the rubric. They then can look at new essays and say “this text has the characteristics of other essays that humans determined should have a score of 4 of 6 on a rubric.” One way to really understand what AES program do is to make this distinction: AES programs don’t score essays, they predict how humans would have scored an essay. The distinction is important conceptually for understanding what is happening, but in the end, if the system works, we get the same scores from computers as we would have from humans.

Automated Essay Scoring: Under the Hood

AES programs read the essays using very different approaches from how humans read the essays. The differences in the approach can be fairly shocking. For instance, the Shermis study notes that “One of the key challenges... was that carriage returns and paragraph formatting meta-tags were missing from the [essays].” That’s right: due to technical limitations (inputting essays as ASCII text), the AES programs didn’t know where kids put paragraph breaks in their essays. But it also didn’t matter, because computers don’t read like humans.

AES programs are bundles of hundreds of algorithms. For instance, one algorithm might be as follows:

Take all the words in the essays and stem them, so that “shooter,” “shooting ,” and “shoot” are all the same word

Measure the frequency of co-location of all two-word pairs in the essay; in other words, generate a giant list of every stemmed word that appears adjacent to another stemmed word and the frequency of those pairings

For each new essay, compare the frequency of stemmed word pairings to the frequencies found in the training set.

Does this sound ridiculous? Weird? Doesn’t matter. It doesn’t matter that the computer reads completely differently from a human, because it’s not trying to “read” the essay. It’s trying to compare the essay to other essays which have been scored by humans and faithfully replicate the scores. If a weird comparison improves the prediction of human grades, then it’s helping the machine faithfully replicate the scores than humans would have provided, even if it does so entirely differently from how humans would generate the scores.

So the AES programs will simultaneously and instantly run hundreds of these algorithms and put the results into a statistical model. (If you have some technical chops, you can read the User Manual for LightSIDE, the open source entry in the competition and get a sense of what some of these algorithms are.) The output of the model is a score--again, not a score originally generated by the machine but a prediction of how a human would have scored the essay.

Results from the AES Study: They Work

What Mark Shermis’ study demonstrates, pretty powerfully, is that these AES programs can do a remarkably good job of reliably predicting how humans would have scored a given essay. In the study, they took eight sets of essays which had all been graded by humans (from standardized tests in six states), and for each set, they gave the AES companies a sample of essays with the grades and a sample with the grades withheld. The vendors used the graded samples to train the programs, and then predicted the samples with the grades withheld. If you have some time, you can read the Shermis study to see exactly how well they did.

Because every essay set had different scoring protocols and procedures, they had to use some fancy statistics to compare results across the 8 different essay sets, which makes parts of the paper hard to read. If you want one simple view of the data, check out Table 4, which shows how much the AES programs mean scores differed from the actual scores. You can also look at Figure 6, which is the best statistical comparison of the scores. The dark blue line represents the reliability between the two human graders and the final grade, and the other lines are the programs. Since all the lines are roughly overlapping, this shows that the AES machines are predicting scores with high lives of fidelity.

Automated Essay Scoring in Review

So to review:

1) Automated Essay Scoring programs predict how humans would score an essay
2) They require a “training set” of essays scored by human raters, a sample of the full set of essays to be scored
3) The don’t “read” an essay like humans do, rather, they use hundreds of algorithms simultaneously to “faithfully replicate” human grading
4) They place no constraints on what kinds of writing can be scored (except poetry and some other classes of creative writing) or what kinds of questions can be asked. Technology has now reached the point where if humans can reliably evaluate the quality of an essay, then AES programs can reliably predict those quality ratings.

So digest that for today, and I’ll come back later with Part II: How this might change standardized assessment? and Part II: How this might change classroom teaching?

If you have questions about any of this, please leave them in the comments and I’ll try to address them either myself or by asking other experts. If you have some expertise in this area, and you feel like I’ve misstated something, please offer your corrections in the comments.

As always, follow me on twitter at @bjfr and for my own research papers and presentations, visit EdTechResearcher.org.

The opinions expressed in EdTech Researcher are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.

Grading Automated Essay Scoring Programs- Part I (@bjfr)

Sign Up for EdWeek Update