Content area
Full Text
Data mining involves ransacking data and searching for patterns, without being motivated by theories, common sense, or wisdom. The inescapable problem is that patterns can always be found, even in random numbers, so finding a pattern proves nothing at all.
Decades ago, data mining was considered a sin, akin to plagiarism. If someone presented research that seemed implausible or too good to be true (for example, a near perfect correlation), a rude retort was "data mining!"
Nowadays, data mining is often considered a virtue, not a vice. In 2008, Chris Anderson, Editor-inChief of Wired, wrote an article with the provocative title, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete." Anderson argued:
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
A 2015 article in The Economist argued that macroeconomists (who study unemployment, inflation, and the like) should abandon the scientific method and, instead, become data miners:
Macroeconomists are puritans, creating theoretical models before testing them against data. The new breed ignore the white board, chucking numbers together and letting computers spot the patterns.
The Economist is a great magazine, but this was not great journalism. Let's look at several examples of worthless pattern spotting.
Fighting Crime With Facebook
A data savvy amusement park data mined the Facebook accounts of local residents to see if surges and slumps in the use of certain words might be helpful in predicting park attendance. They identified the 200 most popular words (100 nouns, 50 adjectives, and 50 adverbs) in the English language. Then they collected daily data for 10 summer weeks on the frequency with which these 200 words were used in status updates on Facebook and amusement park attendance the next day. All the data were scaled to equal 100 at the start of the study, thus a value of 101 means 1 percent more than initially, and 99 means 1 percent less.
They found that the two most helpful words for predicting park attendance were day and most, and that the correlation between actual attendance and...