Interpreting i3 Scores: Good Luck
After announcing the 49 winning applicants for the $650 Investing in Innovation competition, the U.S. Department of Education has now put online the scores, judges' comments, and more details about each project.
Trying to make sense of the numerical scores for the validation and development award winners is, at least for this blogger, an exercise in futility. And it's all because of a statistical process called "standardization."
For me, the quest to understand the i3 scoring system began with this question: Did Saint Vrain School District really have the best application of them all?
Because of the large number of applicants and judges in the validation and development categories, the department used standardization to make sure the scoring was done as fairly as possible. (For more about this, read the department's explanation on page 2 of this FAQ document.) It's important to note that for the 19 scale-up applicants, standardization was not used, and their scores are their raw scores. Also, for Race to the Top, raw scores were used and no adjustments were made for scoring anomalies, which did spark some questions.
As a case study in standardization—and in how meaningless the i3 judges' actual raw scores really are—let's look at Saint Vrain School District's score sheet. This caught my attention because not only did this Colorado school district win a development grant, it had the highest score of all applicants, at 116.95. (That's on a grading scale of 105; more on that later.)
This application and other winning ones had five judges: three subject-matter experts who zeroed in on the proposal and two research experts who evaluated the evidence presented. All attended mandatory department training sessions.
In looking at the judges' scores, you'd have no idea this was a winning application, much less the highest-rated of them all. And the two who judged the evidence components seemed to raise legitimate questions about the research behind the district's proposal, and how the district planned to evaluate its new program to help English-language learners by focusing on STEM, or science, technology, engineering, and math, subjects. (To be clear, this is not a critique of Saint Vrain, but more an exploration of the i3 score standardization process.)
One subject-matter judge gave Saint Vrain just 42 out of 80 points possible. Another one gave it 50 out of 80. And one judge gave it 75 out of 80. On the evidence side, the district got 10 of 25 points from one judge and 13 out of 25 points from another.
Other winning applications got higher, more consistent scores. And my guess is if you looked at the raw scores for applications that didn't win, there would be some that were better than Saint Vrain's.
So I asked the department last night: How in the world, with not-so-great raw scores like that, did Saint Vrain win and get the highest score?
Department officials told me it all boils down to standardization, which seeks to balance out hard graders and easy graders across applications so that an applicant isn't penalized for getting assigned hard graders. For example, the two subject-matter judges who gave Saint Vrain those low scores must have been very, very hard graders, as evidence by the grades they gave on other applications as well. The same thing must have been true for the research-expert judges, who must also have been very tough graders.
But as one Politics K-12 reader who was also questioning Saint Vrain's standardized score pointed out, how likely is it that a truly strong proposal would be marked that low by anyone, let alone two subject-matter judges and two research judges?
So, at least from my point of view, this makes the scores you see on the score sheets relatively meaningless. And even the comments may not be as meaningful as they could be because some of the tough graders were judged, as least by some statistical program, to be too tough, while other graders were judged to be too easy.
Moreover, it's also confusing that the final scores listed by the department go above 105, which is the maximum amount of points available on the competition's grading scale. Again, blame standardization.
I am by no means a statistical expert, so please weigh in on this issue in the comments section below. I'm especially interested in other noteworthy things you've spotted on the i3 score sheets.