In 2o11, Peter Huber wrote a book about what we (the collective we, or the we of data analysis enthusiasts) have learned about the science and art of data analysis over the past 50 years.

To anyone who is interested in data visualization we highly recommend reading this book. Over the course of about 200 pages Huber carefully outlines exactly what data analysis is, what the current challenges are and how they can be overcome.

One of the aspects of this work that we find particularly interesting is the fact that he gives us, in great detail, a road map or checklist of activities that should be completed, in a particular order, in order for any data analysis project to be successful and meaningful.

Allow us to examine Huber’s checklist and provide our own insights into how we can help.

The Data Analysis Checklist

Here are the nine steps to a successful data analysis of any size data set as Huber points them out.

  • Inspection
  • Error checking
  • Modification
  • Comparison
  • Modeling and Model fitting
  • Simulation
  • What-if analyses
  • Interpretation
  • Presentation of conclusions

Even a cursory look at this list raises some interesting questions, but anyone can see the logic in the steps.

Inspection “is quality control” while error checking has to deal more with the quality of the data in terms of completeness and whether or not the analyst can read it in the first place. Modification, comparison, modeling and model fitting are all connected and are functions of making sure that the data is not overanalyzed and fits well with whichever statistical framework the analyst is working with.

What is most interesting about this list, however are the last few bullet points. Simulation, what-if analyses, interpretation and presentation. In these steps the data is put through different models with different thresholds to see if other results can be achieved, or a section of the data is intentionally left out to try and determine if new patterns of causation arise.

What is most troubling about these steps, however, is that according to Hube these “belong to the domain of the data scientist.”

Data Analysis Technology

Since Hube wrote this book in 2011 there has been an explosion in online tools that make his checklist easier for someone without a data analysis background to quickly and professionally move through this checklist.

It is important to realize, however, that the technology available to day has not progressed far enough to provide all of the skills of a data scientist within one tool. But with a little creative thinking you will find that with a combination of tools you will be able to reap the same benefits.

For example, a business user can call on the expertise of programs and companies such as Information Builders or Pentaho Data Integration which contain within them automated checks for data massage and varying levels of “quality control” as Hube puts it.

These tools, while revolutionary and easy to use, do not address the simulation, what-if analyses, interpretation or presentation of that same data. For that you need a tool that can apply precise, customized thresholds to the data and then let you group, sort and filter that information based on the status of any one element in order to perform the kinds of analysis Hube is describing.

Let us explain using VisualCue as an example.

mosaic-view

When describing “what-if analyses” Hube states that such exercises include “omitting a certain subset of the data from analysis” and seeing what kind of difference that makes. In the traditional data analysis paradigm this was the domain of the data scientist who would, often using complex equations and models, remove a customized subset of the data and then see how that changed the spreadsheet they were working with.

In VisualCue you can set up an A/B switch to any data set. This switch allows you to apply different filters to the same data set and then see, in VisualCue’s iconic visual language, exactly how that changes the data. What would once take hours now takes minutes.

Further, Hube says that interpretation should go beyond the simple, quantifiable declaration of p-values and statistical analysis and into the realm of inference and logical conclusions based on the insights the data scientist discovered from simulation, modeling and so forth. Generally this sort of inference took years of training to decipher from the mountains of data to be sorted and sifted.

Technologies like VisualCue take complicated data sets and present them in such a way that preserves all the detail of the original data set without the complexity. Seeing data visually allows anyone, regardless of training, the ability to make connections and see new patterns that would have taken a data scientist weeks to figure out.

That’s the power of visual storytelling.

Until next time,

The VisualCrew