Grading Automated Essay Scoring Programs- Part II: Policy
How could machines that automatically grade essays lead to Deeper Learning? On the face of it, the premise sounds preposterous. But I'm increasingly convinced that there is a potentially valuable policy strategy here, and this post provides an overview. But first, a review of what Automated Essay Scoring programs are.
Review of Part I: Automated Essay Score Predictors
My interest in exploring this subject came from a recent study showing that Automated Essay Score Programs had achieved the same level of reliability as human raters, and a subsequent conversation with Will Richardson and Vantage Learning.
Part I in this series examined the question: How do Automated Essay Scoring Programs work? One way to frame the answer to that question is this: "Automated Essay Scoring Programs" is a misleading name about what these tools do. It would be much better to call them "Automated Essay Score Predictors." AES program are not very sophisticated at understanding the semantic (meaning-oriented) and syntactic (organization-oriented) elements of human writing. They are not capable of taking a piece of text and examining how that text fulfills the categories defined in a rubric. But if you give those machines access to examples of student writing that humans have already graded, they are incredibly good at predicting how humans would grade other essays. Basically, with some training, they are capable of predicting how humans would score an essay with a level of reliability that rivals the reliability between two humans scoring the same essay.
I'm not sure how the AES vendors would respond to making this distinction between "scoring" and "score prediction" (if you are reading this, let me know in the comments or email me!). But I think this distinction is quite helpful in understanding what these machines can and cannot do.
How Automated Essay Score Predictors could Incentivize Deeper Learning
Last week, Barbara Chow, the director of the education program at the Hewlett Foundation explained to a meeting of grantees why the foundation was investing in research concerning Automated Essay Score Predictors as part of their strategy of expanding opportunities for Deeper Learning in schools. (Disclosure: I run a Hewlett-funded research project, and Hewlett has indirectly paid me a salary for four years, though Harvard is my direct employer. That said, when I had a chance to speak for 15 minutes at the Grantee meeting, I devoted the entire time to explaining how their Open Educational Resources grantmaking program could potentially be expanding educational inequalities. So there is some evidence that I try to call it as I see it.) Again, it's the kind of argument that raises eyebrows. "If we replace human essay raters with machines, students will have a richer learning experience." Oh, really?
First point: there are two consortia (PARCC and SBAC) developing new tests for the Common Core Standards. In 2014 or 2015, we're going to have some brand new tests in states all across the country. We have an opportunity to make them better. Here's how Barbara makes the case that Automated Essay Score Predictors can do that
Here is an example of a test question from the AP US History test (2006 Released Exam):
Which of the following colonies required each community of 50 or more families to provide a teacher of reading and writing?
E. Rhode Island
Now, this is the kind of question that makes most educators go berserk. A student can have a deep, rich understanding of early American history and not know that factoid. So what if we could replace questions like that, with questions like this (thanks to College Board for sharing):
By the early twentieth century, the United States had emerged as a world power. Historians have proposed various dates for the beginning of this process, including the three listed below. Choose one of the three dates below or choose one of your own, and write a paragraph explaining why this date best marks the beginning of the United States' emergence as a world power. Write a second paragraph explaining why you did not choose the other dates. Support your argument with appropriate evidence.
- 1898 (Spanish-American War)
- 1917 (Entry into the First World War)
- 1941 (Entry into the Second World War)
I have some quibbles, but this is a much, much better question. The question calls upon several skills broadly identified with deeper learning: solving an ill structured problem—one without a correct answer and requiring tacit knowledge—and communicating that answer in a persuasive, evidence-based argument.
Nearly everyone would agree that question 2 is better than question 1. Cost is the the main reason we ask multiple-choice questions over having students write open responses. It's expensive to train, hire, and evaluate armies of raters to read student work. In theory, AES programs change the policy dynamic in three ways. First: it becomes much cheaper to score essays. Second, since you need fewer humans to score essays, you can train those humans better (this addresses the comments raised by @ceolaf in Part I). You can ask more complex questions because you can have better trained people doing more sophisticated rating. Third, since they are cheaper to score, test designers can include more (and more sophisticated) essay questions and fewer multiple choice questions.
So the Hewlett bet is that if you can use technology to allow the tests to ask more complex questions, we'll get more writing in schools, teachers will be forced to create richer learning environments to prepare students for more complex questions, and students will have more opportunities in schools for deeper learning.
Lots could go wrong here. Systems could keep the same tests, use AES programs, and take every penny saved and spend it on the rising cost of healthcare. Test designers could write dumb essay questions or dumb rubrics. (But they are not forced to do so by the technology; AES programs can predict scores on sophisticated questions or source-based questions as well as they can with simpler questions; the limiting reagent is the capacity of humans to agree on scores with a nuanced rubric, not the limit of the technology. Similarly, students can game the rubric, but they can't game the AES programs. Students might be able to game a human scoring a rubric, but students can't game a program evaluating the frequency of co-located stemmed words. The key limitation in the system is human training (which is @ceolaf's point).)
But for all that, I think it's entirely plausible that by leveraging the power of computer programs to instantly and inexpensively predict essay scores, we can create more sophisticated assessments and have those better assessments drive better classroom practices. The worst case scenario that I envision is that even though we make kids do more writing, it's still stupid writing. In our current policy context, having kids write more is a downside I'm willing to risk.
This argument is not in conflict with the argument that we have too much testing, which I also believe. We should have less testing, and the testing we have should be better. It is worth experimenting with whether AES predictors can help with the latter issue.
One last point, I'm thrilled that CMU submitted an open source score predictor in the contest, and doubly thrilled that it does so well. I really hope that the consortia look very seriously at the tremendous advantages of using a scoring mechanism that scholars and data analysts around the world can collaborate on and improve collectively.
There is lots more to talk about here, and plenty to debate and discuss. I hope if you have questions or are enraged by something, you'll leave a note in the comments. In my next and last post in the series, I'll talk about how Automated Essay Score predictors could be used by teachers and students in classroom settings for teaching and learning about writing.