Resistance is futile

The Borg are a fictional alien race that are a terrifying antagonist in the Star Trek franchise. The phrase “Resistance is futile” is best delivered by Patrick Stewart in the episode The Best of Both Worlds.

When IBM demonstrated the power of Watson in 2011 by defeating two of the best humans to ever play Jeopardy, Ken Jennings who won 74 games in a row admitted in defeat, “I, for one, welcome our new computer overlords.”

As the Edward Snowden revelations about the collection of metadata for phone calls became known, the first thinking was that it would be technically impossible to store data for every single phone call – the cost would be prohibitive. Then Brewster Kahle, one of the engineers behind the Internet Archive made this spreadsheet to calculate the storage cost to record and store one year’s-worth of all U.S. calls. He works the cost to about $30M which is non-trivial but not out of reach by any means for a large US Gov’t agency.

The next thought was – ok so maybe it’s technically feasible to record every phone call, but how could anyone possibly listen to every call? Well obviously this is not possible, but can search terms be applied to locate “interesting” calls? Again, we didn’t think so, until another N.S.A. document, cited by The Guardian, showed a “global heat map” that appeared to represent how much data the N.S.A. sweeps up around the world. If it were possible to efficiently mine metadata, data about who is calling or e-mailing, then the pressure for wiretapping and eavesdropping on communications becomes secondary.

This study in Nature shows that just four data points about the location and time of a mobile phone call, make it possible to identify the caller 95 percent of the time.

IBM estimates that thanks to smartphones, tablets, social media sites, e-mail and other forms of digital communications, the world creates 2.5 quintillion bytes of new data daily. Searching through this archive of information is humanly impossible, but precisely what a Watson-like artificial intelligence is designed to do. Isn’t that exactly what was demonstrated in 2011 to win Jeopardy?

The Dark Side of Big Data

study published in Nature looked at the phone records of some 1.5 million mobile phone users in an undisclosed small European country, and found it took only four different data points on the time and location of a call to identify 95% of the people. In the dataset, the location of an individual was specified hourly with a spatial resolution given by the carrier’s antennas.

Mobility data is among the most sensitive data currently being collected. It contains the approximate whereabouts of individuals and can be used to reconstruct individuals’ movements across space and time. A simply anonymized dataset does not contain name, home address, phone number or other obvious identifier. For example, the Netflix Challenge provided a training dataset of 100,480,507 movie ratings each of the form <user, movie, date-of-grade, grade> where the user was an integer ID.

Yet, if individual’s patterns are unique enough, outside information can be used to link the data back to an individual. For instance, in one study, a medical database was successfully combined with a voters list to extract the health record of the governor of Massachusetts. In the case of the Netflix data set, despite the attempt to protect customer privacy, it was shown possible to identify individual users by matching the data set with film ratings on the Internet Movie Database. Even coarse data sets provide little anonymity.

The issue is making sure the debate over big data and privacy keeps up with the science. Yves-Alexandre de Montjoye, one of the authors of the Nature article, says that the ability to cross-link data, such as matching the identity of someone reading a news article to posts that person makes on Twitter, fundamentally changes the idea of privacy and anonymity.

Where do you, and by extension your political representative, stand on this 21st Century issue?