Is Big Data the answer to all our problems?

Noah Smith recently commented on a Malcom Gladwell talk in which he (Gladwell) expressed doubts about the promise of “Big Data.”  When people use the term “Big Data” they are often referring to datasets that are built from digital data streams like Google searches, the Twitter feed, and so forth but it can also refer simply to large datasets that were previously too cumbersome for researchers to access easily. In any case, Gladwell says that this newfound data availability is not our salvation. He also claims that this data might be a curse. Noah devotes most of his comment to countering Gladwell’s claims. I’m actually not a huge Malcom Gladwell fan and, like Noah, I basically disagree with the idea that having more data could be a problem for us.  More data must be better. (OK, there is the Snowden issue and government invasions of privacy more generally but let’s leave those problems aside – I don’t think that kind of intrusive survelance is what either Gladwell or Noah has in mind).

There are some aspects of Big Data that I’ve been thinking about for a little while that seem at least somewhat relevant to Gladwell’s argument and Noah’s post. Many researchers – and many economists in particular – see Big Data as a huge benefit to their field.  Indeed, some view the arrival of these new datasets as a transformative event in social science.  Speaking for myself only, I have some doubts that this new data will be as much of a benefit as many are predicting. In economics, many researchers are being drawn to these datasets without having a direct purpose or plan in mind. To me, this is most concerning with graduate students who are under lots of pressure and sometimes hold out hope that a huge dataset will be like the Holy Grail for an underdeveloped research portfolio. After waiting to obtain their data, the graduate students typically are let down when they realize that the data doesn’t really address the questions they were interested in, or that the data needs to be cleaned and arranged into a useable form which takes a tremendous amount of work, or that they were hoping that the dataset would present an obvious killer question or killer instrument and the data fails to deliver.

Another thing that pops up all too frequently is the idea that a bigger dataset is automatically superior simply because it has many observations. This is often clearly not the case and it’s painful to see this realization fall upon a researcher (sometimes during their presentations). To take a really obvious example, suppose you are interested in whether extensions of unemployment benefits reduce labor supply by causing people to search more while they are still getting the payments. A dataset with state-level unemployment rate data over the period 2005-2014 might actually be able to speak to this question. In contrast, a dataset with 100 million daily individual observations for a year isn’t going to help you at all if there is no variation in unemployment benefit policy in that year. Sure it’s impressive that you can get such a dataset but it isn’t useful for your research question. Sometimes in seminars, the presenter will intentionally advertise the scope of the dataset in a futile effort to impress the audience. It never works. It’s similar to a related unsuccessful tactic of trying to impress the audience by telling them how long it takes your computer to solve a complicated dynamic programming problem.

Stuff like this comes up all the time. I was in a seminar where a researcher was using individual household level consumption data to test the permanent income hypothesis (PIH). The dataset was quite nice but the consumption measure combined both durable and nondurable goods and unfortunately the PIH applies only to nondurable consumption spending. When the researcher was asked why he or she used the individual data rather than aggregate data (which does break out nondurable consumption) his/her response was simply that he/she felt that individual data was better than aggregate data (?).

Firm-level data is another pet peeve of mine. I can’t tell you the number of times I’ve heard people say that the reason that we should use firm level data is because this is just what people do these days. Firm-level data is particularly noteworthy because one of the classic issues in economics deals with the nature of a firm itself. The straight neoclassical perspective is that the notion of a firm is not particularly well defined. Two mechanic shops that operate independently would appear as two firms in a typical dataset but if the owner of one of the shops sells it to the other owner, these firms would suddenly become a single observation. This problem reminds me of a quote by Frank Zappa: “The most important thing in art is the frame. … without this humble appliance, you can’t know where the art stops and the real world begins.” A similar thing occurs with firm level data. We have a bunch of underlying behavior and then there are these arbitrary frames placed around groups of activity. We call these arbitrary groupings “firms.” Arbitrary combinations (or breakdowns) like this surely play a large role in dictating the nature of firm-level data. In the end it’s not clear how many real observations we actually have in these datasets.

In the past I’ve been fortunate enough to work with some students who used hand-collected data.[1] Data like this is almost always fairly small in comparison with real-time data or administrative data. Despite this apparent size disadvantage, self-collected data has some advantages that are worth emphasizing. First, the researcher will necessarily know much more about the way the data was collected. Second, the data can be collected with the explicit aim of addressing a specifically targetted research question. Third, building the data from the ground up invites the researcher to confront particular observations that might be noteworthy for one reason or another. In fact, I often encourage graduate students to look in depth at individual observations to build their understanding of the data. This will likely not happen with enormous datasets.

Again, this is not to say that more data is in anyway a disadvantage. However, like any input into the research process, the choice of data should be given some thought. A similar thing came up perhaps 15 years ago when more and more powerful computers allowed us to expand the set of models we could analyze. This was greeted as a moment of liberation by some economists but soon the moment of bliss gave way to reality. Adding a couple more state variables wasn’t going to change the field; just because the model is solved a bit more accurately and faster won’t expand our understanding by leaps and bounds. Better? No doubt. A panacea? Not at all.

The real constraints on economics research have always been, and continue to be, a shortage of ideas and creativity. Successfully pushing the boundaries of our understanding requires creative insights coupled with accurate quantitative modelling and good data and empirical work. The kinds of insights I’m talking about won’t be found just lying around in any dataset – not matter how big it is.

[1] One of my favorite examples of such data work is by Ed Knotek who collected a small dataset on prices of goods that were sold in convenience stores but also sold in large supermarkets. See “Convenient Prices and Price Rigidity: Cross-Sectional Evidence,” Review of Economics and Statistics, 2011.

Advertisements

13 thoughts on “Is Big Data the answer to all our problems?

  1. Pingback: How Big Data Informs Economics | Economics and Development

  2. I’ve always been fascinated with the applications of data science and this has led me to believe that larger data sets are intuitively better than small ones with given manipulation. However you make some good arguments here in the relevance of that data. Beginning research, some hand-collected data sets were too small and consequently skewed statistical analysis. Furthermore, there usually is a huge margin of error. Big data helps circumvent that and so I believe that data set quality doesn’t rest in size as much as relevance as you touched upon.

  3. While I – wholeheartedly – agree with the general sentiment of the post towards big data (one of caution), a perhaps somewhat unfair and trivializing summary of it is: sure, big data in bad hands, like those of a third rate graduate student (sorry, third rate graduate student), is not going to help economics. Sure. Fortunately, the value of a new tool for economics (or any science for that matter) and society is not determined by what third rate graduate students can do with it, but by what the geniuses do with it (that graduate students can get confused by such new things, granted, but then it is out job to help them find the right path again). Big data (btw: I am not sure big data guys mean by that label just larger data sets, but rather the statistical data pattern recognition software on top of it; be that as it may…) will be, if anything, just another tool that can be used well or not so well. Just like mathematics which produced genius results in the hands of geniuses Samuelson and Arrow and not so interesting stuff in lesser hands. Computers: it’s what Prescott and some of his students did with it, not last year’s third placed Minnesota graduate student. Structural work in IO: read Pakes and Berry, but the same tool can look ridiculous in lesser hands, etc. It will be the same with big data (or not at all) – btw: the genius here could well be a graduate student right now.

    And now for a few concrete examples of in my view exciting “big data” work: the research on inequality and income shocks (Saez, Piketty, Guvenen), Mian and Sufi’s work on consumption and employment responses to house price declines, a bit older: Davis and Haltiwanger’s work on the labor market that truly changed our way of thinking about it – are you really going to dismiss those as uninteresting? I just don’t see how inequality research is possible without large administrative data sets. And the insights on pricing behavior we have accumulated in the past ten years: we are basing one of the largest and intrusive government intervention into the economy essentially on the assumption that prices are sticky – shouldn’t we want to know what that actually means?

  4. Great article. There is a lot of discussion on the subject where i work. I am a fan of “big data”. The problem where i work, the public sector, is that people have no idea how to use it. So the question might not be “big data” or not but rather, do I need it and how do I use it in this specific matter.

  5. Hi Chris,
    thanks for this great article. You are totally right, that big data is not automatically better then small data. I worked for example as a business consultant on a project to implement a huge BI system at a german car manufactor. They desired this system so much because then want to analyse their data but they had absolutely no idea what questions should be answered by the data. It was so strange. BigData could be a huge improvement if the data and the questions are good.
    Best regards
    Christoph

  6. Pingback: Is Big Data the answer to all our problems? | GEEKBEANS

  7. Very interesting, I agree and i think most people will that just raw and massive amounts of data do not help anyone, as you stated data has to be properly arranged in order to provide relevant answers. Also i fail to see how more (relevant) data could possibly be negative in any way, the more relevant information the better decision making ability or at least that is my opinion.

    Feel free to check out my Blog in whatonomics.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s