Privacy, Anonymity, and Big Data in the Social Sciences
You can have anonymous data or you can have open science, but you can't have both.
That's the conclusion that several colleagues and I reach in an article now online at Queue and forthcoming in Communications of the Association of Computing Machinery.
The short version: many people have called for making science more open and transparent by sharing data and posting data openly. This allows researchers to check each other's work and to aggregate smaller datasets into larger ones. One saying that I'm fond of is: "the best use of your dataset is something that someone else will come up with." The problem is that increasingly, all of this data is about us. In education, it's about our demographics, our learning behavior, and our performance. Across the social sciences, it's about our health, our beliefs, and our social connections. Sharing and merging data adds to the risk of disclosing those data.
The article shares a case study of our efforts to strike a balance between anonymity and open science by de-identifying a dataset of learner data from HarvardX and releasing it to the public. In order to de-identify the data to a standard that we thought was reasonably resistant to reidentification efforts, we had to delete some records and blur some variables. If a learner's combination of identifying variables was too unique, we either deleted the record or scrubbed the data to make it look less unique. The result was suitable for release (in our view), but as we looked more closely at the released dataset, it wasn't suitable for science. We scrubbed the data to the point where it was problematically dissimilar from the original dataset. If you do research using our data, you can't be sure if your findings are legitimate or an artifact of de-identification.
This was a powerful relevation for many of us, especially in the face of evidence that the weapons of re-identification, in the long run, will probably outpace the shields of de-identification. We all increasingly share so much about ourselves, and ultimately the datasets created outside learning platforms will be able to be merged with datasets from learning platforms to re-identify people. It may simply not be possible to do science with anonymized data, in education or anywhere in the social sciences.
Right now, we conflate privacy with anonymity, though we need not. The Federalist Papers were anonymous but not private. Voting is private but not anonymous. If we are going to have open science with human subjects data, we'll need to explore new approaches to balancing open science and privacy. We conclude our essay:
This example of our efforts to de-identify a simple set of student data--a tiny fraction of the granular event logs available from the edX platform--reveals a conflict between open data, the replicability of results, and the potential for novel analyses on one hand, and the anonymity of research subjects on the other. This tension extends beyond MOOC data to much of social science data, but the challenge is acute in educational research because FERPA conflates anonymity--and therefore de-identification--with privacy. One conclusion could be that the data is too sensitive to share; so if de-identification has too large an impact on the integrity of a data set, then the data should not be shared. We believe that this is an undesirable position, because the few researchers privileged enough to have access to the data would then be working in a bubble where few of their peers have the ability to challenge or augment their findings. Such limits would, at best, slow down the advancement of knowledge. At worst, these limits would prevent groundbreaking research from ever being conducted.
Neither abandoning open data nor loosening student privacy protections is a wise option. Rather, the research community should vigorously pursue technology and policy solutions to the tension between open data and privacy. A promising technological solution is differential privacy.3 Under the framework of differential privacy, the original data is maintained, but raw PII is not accessed by the researcher. Instead, it resides in a secure database that has the ability to answer questions about the data. A researcher can submit a model--a regression equation, for example--to the database, and the regression coefficients and R-squared are returned. Differential privacy has challenges of its own, and remains an open research question because implementing such a system would require carefully crafting limits around the number and specificity of questions that can be asked in order to prevent identification of subjects. For example, no answer could be returned if it drew upon fewer than k rows, where k is the same minimum cell size used in k-anonymity.
We propose that privacy can be upheld by researchers bound to an ethical and legal framework, even if these researchers can identify individuals and all of their actions. If we want to have high-quality social science research and privacy of human subjects, we must eventually have trust in researchers. Otherwise, we'll always have a strict tradeoff between anonymity and science.
If we must have trust in researchers to enable open science, then researchers will need to earn that trust.