Friday, May 31, 2013

Data in Context

Garance Frank-Ruta clarifies the context surrounding the data "discovered" that show former IRS Commissioner Douglas Shulman visiting The White House 157 times during his tenure.  The short story is that the data used to make that claim is imperfect.  A large majority of the supposed visits were in fact unfulfilled invites.  Another claim made from the same dataset is that he visited more often than any cabinet member - another falsehood given that the system referenced is used primarily to allow access to those walking in to the complex - cabinet members, given their seniority, are able to drive on to the White House complex.

This is yet another example of how important it is to leverage data thoughtfully.  Intelligence requires deliberate thought, not quick assertions and grandiose conclusions.  Minimal effort would have reveled the imperfections of the data referenced.  (The system used was built to track appointments within the White House complex, but only for meetings and "typical events.  Access lists for larger events often forgo the use of this system, as do appointments involving more senior government officials cleared to drive in to the complex.)

To ensure one does not fall in to this trap, there are three questions you must first answer, before acting on the information gleaned from a particular dataset:

  1. How was the data collected?
  2. What specific data is included in the dataset?
  3. And, most importantly, what specific data is NOT included in the dataset?

Only with such context can you begin to understand the information available...

Wednesday, May 29, 2013

Tuesday, May 21, 2013

Information Efficiency vs. the Boogyman

I hate when writers use the boogyman to scare people.  Michael Carney does just that with his article on personal data, "You Are Your Data: the Scary Future of the Quantified Self Movement".

I don't negate the fact that a small minority will "do evil" with the growing exposure of personal data.  My point is that someone of Michael's stature and position should not focus on what will undoubtedly be a small faction, at the expense of the larger, more bountiful majority.  The quantified self (and an exponentially increasing other sets of data) are and will continue to deliver value, much of which we are only beginning to see.

From Michael,
For those of us who don’t measure up compared to the rest of the population, the outcome won’t be pretty.
But what about those that are unnecessarily penalized, given today's information inefficiencies?  The truth is that the industries he cites become more efficient with more (personal) data.  Insurance is at it's heart based on information - the more information available, the more effectively and efficiently risk can be priced.  The more risky clients pay more.  Market dynamics at work.

Health insurance, even home mortgages, are quantified bets given the information made available.   Yes, people will have to pay more, but others will have to pay less.

He finishes with an acknowledgement that he is not focused on the value.  Rather, he bases his argument on the need for user awareness.  I agree that privacy policies and terms of service documents need more transparency and less legalese. Using the boogyman to make the point is wrong.

Thursday, May 02, 2013

Calling Bullshit on Big Data

This article has a decent list of ways to call bullshit on data-driven analyses.  Click the link for context, but here are the top points:

  1. Focus on how robust a finding is, meaning that different ways of looking at the evidence point to the same conclusion. 
  2. Data mavens often make a big deal of their results being statistically significant, which is a statement that it’s unlikely their findings simply reflect chance. Don’t confuse this with something actually mattering. 
  3. Be wary of scholars using high-powered statistical techniques as a bludgeon to silence critics who are not specialists. 
  4. Don’t fall into the trap of thinking about an empirical finding as “right” or “wrong.” 
  5. Don’t mistake correlation for causation. 
  6. Always ask “so what?” 
As often occurs with an emerging technology theme, the glitz and glam of the shiny new thing that is big data often overshadows the real value.  The above list is a great start in being sure that the data product or opportunity being pitched truly can add value to your mission.  

#3 is an interesting one - I see a trend in the emerging big data space that vendors and others seeking to exploit big data too often move to high end, overly complex mathematics, when more basic, easier to understand models would suffice.  This is especially true when building out new applications on top of large datasets.  You will often get to the productive answer faster by building simple prototypes before investing more expensive resources.  Data modeling is no different.

#5 above is a particularly important point.  My sense is that it is difficult for most to logically separate the concepts of correlation and causation.  I find myself jumping too far too often, by inferring to much import on a basic correlation that lacks any evidence of causation.  

At the end of the day, high end mathematics do not negate basic economic theory.  Be smart - don't forget your whits when digging in to big data...