MITx and HarvardX Release De-Identified Dataset from First Year of MOOCs
I'm pleased to annouce today that my colleagues at HarvardX and MITx have released a de-identified person-course dataset from 16 courses from the first year of edX; the same dataset that was used to produce HarvardX and MITx: The First Year of Open Online Courses. This massive effort was led by Jon Daries of the MIT office of Institutional Research. We're hopeful that this dataset both signifies our commitment to providing data to the research community that advances the science of learning while protecting student privacy. We're excited to see what folks do with it.
Here's some background on what is in the dataset, and what we did with it.
A "person-course" dataset means that every row in the dataset corresponds to one registration in one-course. A student who registered for Intro to Computer Science from HarvardX and Biology from MITx would have two rows in the dataset. The columns of the dataset are variables like age, gender, grade, whether you earned a certificate, the number of days you were active in a course, and so forth.
De-identified means two kinds of things. First, we removed all of the obvious things that would let a person identify an individual student, like a name or email. We also hashed (scrambled and replaced with an arbitrary string of characters) values like the user_id, so this dataset cannot be connected to other datasets, but unique students can still be identified in the dataset.
Next, we look at anyone who has a unique combination of variables, and we try to either modify those variables or remove rows from the dataset until their are no longer unique. Let's say you are the only person in Biology from Bulgaria, and you introduce yourself as such. If we identify your country of origin as Bulgaria, people can go to the forums, and figure out more things about you. So for all countries with fewer than 5,000 people in the dataset, we just list their general region in the world.
There are also people with extreme behavior or high activity. For instance, most people never post in the forum, but a few do often: 847 times or 93 times. If someone scraped the forums and counted everyone's posts, then they could identify these people. Rather than modify their data, we delete these rows. *This is Important.* To protect anonymity, very active registrants are deleted from the data.
There are also unique combinations of people in the dataset. We've set things so that no person can be distinguished from at least four other people in the dataset (in technical terms, this means we maintain a k-anonymity of 5). For instance, lots of people sign up for unique combinations of courses. If you register for six courses, you might be the only one like that. So we start dropping rows off your dataset (using crazy complicated algorithms to minimize data loss and prevent bias against big courses) until you start looking like at least four other people.
Finally, we got a smart group of Harvard computer science students and had them try to break the dataset and re-identify people. We feel pretty good about where we are right now.
It's always possible that people could re-identify students. Particularly, for instance, if someone scraped all of the edX forums, and all data from social media like Facebook and Twitter where people say things like "OMG! I'm from Tobago and I signed up for Biology, Justice, and Heroes, and I'm going to post in the forums in each course exactly 6 times." We think we've minimized these risks, but we know they are non-zero.
Our original person-course dataset had about 840,000 rows, and the released dataset has 740,000 rows. In the original dataset, 5% of registrants earned a certificate in a course. in the new dataset, 3% of registrants earn a certificate in the course. So the effects of de-identification have a non-trivial impact on the composition of the dataset, especially in regards to our most active users. It will be interesting to see people reproduce our results, and the kinds of discrepencies that come about.
Our team has tried to balance making data accessible to other researchers with protect the privacy of our users. I hope we've done that well, but we look forward to hearing feedback from researchers, privacy advocates, and others.