Computers Can Assess What Computers Do Best
This is the third of three posts about blended learning, inspired by a recent Center for Education Policy Research conference at the Harvard Graduate School of Education that brought together researchers, software developers, funders, and educators to explore research frontiers in blended learning. The previous two posts, about educator frustration with data dashboards and excitement about embedding social-psychology based interventions in technology platforms.
In this post, I want to tackle what I think is the thorniest, most important problem within this system: the limitations of our current assessment technologies. The main problem is this: computers are mainly good at assessing the kinds of things we don't need humans to do anymore. But let's start at the beginning...
Assessment and Blended Learning
The basic model of personalized, blended learning as implemented in schools like Rocketship is as follows: For part of the school day, a kid sits at a computer. The computer teaches the kid stuff. The computer tests a kid on stuff. If the kid gets stuff right, she moves on to new topics. If not, the computer teaches the kid the same stuff, or sends data to the teacher about how to teach the kid that stuff (see previous post on how hard it is for computers to generate actionable data).
(As an aside, the places where this work happens sometimes look pretty funny. This picture from Rocketship looks like a call center that was decorated by the Easter Bunny.)
The lynchpin of this system is the computational assessment. The whole promise of computer-aided instruction is that each kid gets the stuff she needs, because the computer can quickly figure out what she knows and what she doesn't. The kid can get fed more stuff more quickly because computers can assess students instantly and constantly; a teacher simply doesn't have the time to figure out what stuff 30 elementary school students or 140 secondary school students need on a daily, hourly, minutely basis.
What Can Computers Assess?
So if computational assessment is the lynchpin of the system, then we need to ask: What human competencies can computers accurately assess? Basically they can do few things on their own, and one thing with some help. Multiple-choice questions, no problem. They can also evaluate quantitative questions with a single right answer that can be input with a keyboard (4, 2x+2, e^7, etc.). They can also evaluate computer code quite well: whether it works, how quickly it works, where it broke, whether it conforms to certain design specifications, etc. They are getting better at recognizing human speech and pronunciation, with some neat applications for language learning.
With some help, computers can also sort of evaluate human writing of about 400 words and up. They can't really evaluate writing very well on their own, but if you take a sub-sample of at least a hundred essays and have humans grade them, then computers can predict the scores for the rest of the essays with about the same reliability as the humans who grade essays for standardized tests. (They don't do as well with short documents, which don't have enough data for algorithms to classify reliably.)
Overall, computers (without human training) are good at assessing the kinds of things--quantitative things, computational things--that computers are good at doing. Which is to say that they are good at assessing things that we no longer need humans to do anymore.
Take math, for instance. The Common Core State Standards for mathematical modeling describe six parts of mathematical modeling: 1) finding the problem, 2) representing the problem in equations, tables, or graphs, 3) calculating answers, 4) interpreting results, 5) validating conclusions, and 6) explaining reasoning.
In the real world, computers have a lock on #3 and humans have the edge in the other five. However, the only one of these that computers can reliably assess is #3 (unless we fudge and use multiple choice). Computers can't tell us if a human has designed a good representation for a problem or provided a careful, reasoned defense of a mathematical approach to an issue.
If we specify an equation to solve, computers can check to see if a human has solved it correctly. That's because the computer is good at solving that sort of thing. And the sorts of things computers are good at solving, we don't really need humans to do any more.
Now, if a school's main goal is to use software to program children to take tests like PARCC and Smarter Balance, this isn't such a big deal. The PARCC and Smarter Balance tests will be constrained by these same dynamics, and they will be limited to the same type of computationally-assessable problems (except possibly some problems that might be farmed out to humans to grade). We can probably design software that programs children to take these tests. However, from a preparing humans for life in the real world, this is a real problem at the heart of blended learning. Software may prove best at preparing students for things that humans no longer need to do.
Future Directions in Computational Assessment Research
I'm quite interested in these assessment problems, especially for my work with HarvardX, and I'm interested in four lines of inquiry that might lead to better assessments in personalized online environments.
One of the fastest developing alternative to machine assessment is peer assessment. Can we get students to evaluate each other? Computers are very good at taking care of all the logistics so dozens or hundreds or thousands of students can evaluate each other's work. And it may be that if you get five or six peers to evaluate another student, the average of their assessments might be pretty close to a teacher's assessment. This might not be politically feasible in high-stakes situations (we don't want little Johnny involved in high-stakes assessment of little Sally), but for various kinds of formative assessments, peer assessment tools might help us expand the range of the assessable.
Second, there are folks at HarvardX and elsewhere working on annotation tools; tools that allow learners to annotated and comment upon selections of text. Humanists have been using annotation and marginalia to develop and demonstrate their understanding for millennia, so this seems like a worthy exploration. If we ask students to explain their reasoning as they interpret a text, can we measure their ability to identify and comment upon relevant selections of text?
There are many different people interested in how game mechanics might be used in assessment situations. Virtual simulations can present people with complex, ill-structured problems, and then see how people approach these problems. In these assessment models, we might be able to evaluate both people's solutions and the process a person follows to get to a solution. In many of these tools, however, the final step tends to be having students do some writing that looks an awful lot like a document-based question or other writing format that brings us back to all of the problems with assessing essays and other unstructured text.
As a final category, I think Sam Wineburg's work with Beyond the Bubble is a worthy model of domain-specific, tractable, nuanced assessment questions. Beyond the Bubble asks students to perform specific applications of historical thinking skills with primary source documents. That model wasn't developed with computational assessment in mind (and it would face challenges with the short length of student answers), but it would be interesting to try to get computer assessments to work in those domains.
This assessment question is really the lynchpin of blended learning models, or other models that depend upon software to assess and teach students. As long we face these limits on what computer tools can assess, we'll face serious limits on the domains where computers can supplement or replace human teachers. The domains where computer assessment falls short may prove to be the most important domains for student learning.