Theory and Petabyte Science

I recently came across an article in Wired, “The End of Theory:  Data Deluge Makes the Scientific Method Obsolete“.  To sum it quite quickly:  the volumes of data available to entities such as Google allow analysis of data without hypotheses, facilitating scientific progress without the need for theory.

I find the article troubling since someone will probably take it as plausible and a path to the future.  As a sociologist of the Internet, I appreciate the value of the vaste troves of data now available.  The scale of data and the computational power to analyze it should propel science in all fronts forward more rapidly than we have previously experienced, but it cannot do so without thoughtful theory.

The Wired article is unclear as to whether it reads the message of “correlation is enough” as a justification for massive data mining or if the author reads this progress to mean hypotheses prior to data analysis are a thing of science past.  Having been involved in a number of research projects over the years, I believe the scientific method of deriving theory deductively, then collecting appropriate data and testing only those hypotheses is pragmatically little more than rhetorical device.  Given the costs of data collection, finding data that’s reasonably relevant to the broad research interests usually is a first step, followed by exploratory analysis, and then finally framing the findings with the classic format of theory and hypothesis testing.

Inductive science works well in this fashion.  What the article seems to suggest, and may lead researchers to accept, is that theory itself can be neglected and only the statistically significant results from a rigorous data mining algorithm matter.  This sort of data mining would lead to bans on the public sale of ice cream (increased ice cream sales are highly correlated with incidents of rape) and immigration restrictions on the Mississippi Delta (the population of Louisiana and the area of the state are significantly negatively correlated).  Without theory and reasonable thought, the true causal mechanisms are left out of these models.

When it comes to the advertising revenue businesses like Google are really interested in, this difference is inconsequential.  Tertiary variables also correlated with the underlying causal mechanism can work well.  They’re the very foundation of stereotypes, racial profiling, biased polling, and direct mailing campaigns.  If we know someone supports the ACLU, they’re more likely than a random American to be likely to support the Sierra Club.  If we know an individual drives a Prius, they probably are more likely to use airplanes more often than Greyhound.

When the comparison is “random”, such ‘statistically different from zero’ findings in exploratory data analysis are fine, and will make Google a lot of money.  When it comes to science, however, we still need theory to tell us that higher temperatures and increased gregarious behaviors lead to both ice cream consumption and frequency of rape cases, rather than the two being in any way related to one another, and that as time progresses more people live (at least pre-Katrina) in Louisiana while the delta is eroding into the Gulf of Mexico.  Without theory, science would suffer greatly from accepting coincidental correlations as causal and overlooking true causal mechanisms because the right data was not being analyzed.

Correlations may be enough for marketing, but they are not sufficient for Science.