The Dark Side of Big Data

A study published in Nature looked at the phone records of some 1.5 million mobile phone users in an undisclosed small European country, and found it took only four different data points on the time and location of a call to identify 95% of the people. In the dataset, the location of an individual was specified hourly with a spatial resolution given by the carrier’s antennas.

Mobility data is among the most sensitive data currently being collected. It contains the approximate whereabouts of individuals and can be used to reconstruct individuals’ movements across space and time. A simply anonymized dataset does not contain name, home address, phone number or other obvious identifier. For example, the Netflix Challenge provided a training dataset of 100,480,507 movie ratings each of the form <user, movie, date-of-grade, grade> where the user was an integer ID.

Yet, if individual’s patterns are unique enough, outside information can be used to link the data back to an individual. For instance, in one study, a medical database was successfully combined with a voters list to extract the health record of the governor of Massachusetts. In the case of the Netflix data set, despite the attempt to protect customer privacy, it was shown possible to identify individual users by matching the data set with film ratings on the Internet Movie Database. Even coarse data sets provide little anonymity.

The issue is making sure the debate over big data and privacy keeps up with the science. Yves-Alexandre de Montjoye, one of the authors of the Nature article, says that the ability to cross-link data, such as matching the identity of someone reading a news article to posts that person makes on Twitter, fundamentally changes the idea of privacy and anonymity.

Where do you, and by extension your political representative, stand on this 21st Century issue?