Is Big Data the answer to all our problems?

Noah Smith recently commented on a Malcom Gladwell talk in which he (Gladwell) expressed doubts about the promise of “Big Data.”  When people use the term “Big Data” they are often referring to datasets that are built from digital data streams like Google searches, the Twitter feed, and so forth but it can also refer simply to large datasets that were previously too cumbersome for researchers to access easily. In any case, Gladwell says that this newfound data availability is not our salvation. He also claims that this data might be a curse. Noah devotes most of his comment to countering Gladwell’s claims. I’m actually not a huge Malcom Gladwell fan and, like Noah, I basically disagree with the idea that having more data could be a problem for us.  More data must be better. (OK, there is the Snowden issue and government invasions of privacy more generally but let’s leave those problems aside – I don’t think that kind of intrusive survelance is what either Gladwell or Noah has in mind).

There are some aspects of Big Data that I’ve been thinking about for a little while that seem at least somewhat relevant to Gladwell’s argument and Noah’s post. Many researchers – and many economists in particular – see Big Data as a huge benefit to their field.  Indeed, some view the arrival of these new datasets as a transformative event in social science.  Speaking for myself only, I have some doubts that this new data will be as much of a benefit as many are predicting. In economics, many researchers are being drawn to these datasets without having a direct purpose or plan in mind. To me, this is most concerning with graduate students who are under lots of pressure and sometimes hold out hope that a huge dataset will be like the Holy Grail for an underdeveloped research portfolio. After waiting to obtain their data, the graduate students typically are let down when they realize that the data doesn’t really address the questions they were interested in, or that the data needs to be cleaned and arranged into a useable form which takes a tremendous amount of work, or that they were hoping that the dataset would present an obvious killer question or killer instrument and the data fails to deliver.

Another thing that pops up all too frequently is the idea that a bigger dataset is automatically superior simply because it has many observations. This is often clearly not the case and it’s painful to see this realization fall upon a researcher (sometimes during their presentations). To take a really obvious example, suppose you are interested in whether extensions of unemployment benefits reduce labor supply by causing people to search more while they are still getting the payments. A dataset with state-level unemployment rate data over the period 2005-2014 might actually be able to speak to this question. In contrast, a dataset with 100 million daily individual observations for a year isn’t going to help you at all if there is no variation in unemployment benefit policy in that year. Sure it’s impressive that you can get such a dataset but it isn’t useful for your research question. Sometimes in seminars, the presenter will intentionally advertise the scope of the dataset in a futile effort to impress the audience. It never works. It’s similar to a related unsuccessful tactic of trying to impress the audience by telling them how long it takes your computer to solve a complicated dynamic programming problem.

Stuff like this comes up all the time. I was in a seminar where a researcher was using individual household level consumption data to test the permanent income hypothesis (PIH). The dataset was quite nice but the consumption measure combined both durable and nondurable goods and unfortunately the PIH applies only to nondurable consumption spending. When the researcher was asked why he or she used the individual data rather than aggregate data (which does break out nondurable consumption) his/her response was simply that he/she felt that individual data was better than aggregate data (?).

Firm-level data is another pet peeve of mine. I can’t tell you the number of times I’ve heard people say that the reason that we should use firm level data is because this is just what people do these days. Firm-level data is particularly noteworthy because one of the classic issues in economics deals with the nature of a firm itself. The straight neoclassical perspective is that the notion of a firm is not particularly well defined. Two mechanic shops that operate independently would appear as two firms in a typical dataset but if the owner of one of the shops sells it to the other owner, these firms would suddenly become a single observation. This problem reminds me of a quote by Frank Zappa: “The most important thing in art is the frame. … without this humble appliance, you can’t know where the art stops and the real world begins.” A similar thing occurs with firm level data. We have a bunch of underlying behavior and then there are these arbitrary frames placed around groups of activity. We call these arbitrary groupings “firms.” Arbitrary combinations (or breakdowns) like this surely play a large role in dictating the nature of firm-level data. In the end it’s not clear how many real observations we actually have in these datasets.

In the past I’ve been fortunate enough to work with some students who used hand-collected data.[1] Data like this is almost always fairly small in comparison with real-time data or administrative data. Despite this apparent size disadvantage, self-collected data has some advantages that are worth emphasizing. First, the researcher will necessarily know much more about the way the data was collected. Second, the data can be collected with the explicit aim of addressing a specifically targetted research question. Third, building the data from the ground up invites the researcher to confront particular observations that might be noteworthy for one reason or another. In fact, I often encourage graduate students to look in depth at individual observations to build their understanding of the data. This will likely not happen with enormous datasets.

Again, this is not to say that more data is in anyway a disadvantage. However, like any input into the research process, the choice of data should be given some thought. A similar thing came up perhaps 15 years ago when more and more powerful computers allowed us to expand the set of models we could analyze. This was greeted as a moment of liberation by some economists but soon the moment of bliss gave way to reality. Adding a couple more state variables wasn’t going to change the field; just because the model is solved a bit more accurately and faster won’t expand our understanding by leaps and bounds. Better? No doubt. A panacea? Not at all.

The real constraints on economics research have always been, and continue to be, a shortage of ideas and creativity. Successfully pushing the boundaries of our understanding requires creative insights coupled with accurate quantitative modelling and good data and empirical work. The kinds of insights I’m talking about won’t be found just lying around in any dataset – not matter how big it is.

[1] One of my favorite examples of such data work is by Ed Knotek who collected a small dataset on prices of goods that were sold in convenience stores but also sold in large supermarkets. See “Convenient Prices and Price Rigidity: Cross-Sectional Evidence,” Review of Economics and Statistics, 2011.