Big Data and Information Inequality

Mike Wu writing in Tech Crunch observed that in all realistic data sets (especially big data), the amount of information one can extract from the data is always much less than the data volume (see figure below): information << data.

Big Data

In his view, given the above, the value of big data is hugely exaggerated. He then goes on to infer that this is actually a strong argument for why we need even bigger data. Because the amount of valuable insights we can derive from big data is so very tiny, we need to collect even more data and use more powerful analytics to increase our chance of finding them.

Now machine data (aka log data) is certainly big data, and it is certainly true that obtaining insights from such dataset’s is a painstaking (and often thankless) job, but I wonder if this means we need even more data. Methinks we need to be able to better interpret the big data set and its relevance to “events”.

Over the past two years, we have been deeply involved in “eating our own dog food” as it were. At multiple EventTracker installations that are nationwide in scope, and span thousands of log sources, we have been working to extract insights for presentation to the network owners. In some cases, this is done with a lot of cooperation from the network owner and we have a good understanding of IT assets and the actors who use/abuse them. We find that with such involvement we are better able to risk prioritize what we observe in the data set and map to business concerns. In other cases where there is less interaction with the network owner and we know less about the actors or the relative criticality of assets, then we fall back on past experience and/or vendor-provided info as to what is an incident.  It is the same dataset in both cases but there is more value in one case than the other.

To say it another way, to get more information from the same data we need other types of context to extract signal from noise. Enabling logging at a more granular level from the same devices thereby generating an ever bigger dataset won’t increase the signal level. EventTracker can merge change audit data netflow information as well as vulnerability scan data to enable a greater signal-to-noise ratio. That is a big deal.