Big Data Requires a New Kind of Expert: The Econinformatrician

 Simple Models + Big Data = Econinformatics

Economics Theory + Big Data = Econinformatics

I recently had the opportunity to attend a conference held in honor of the great econometrician Dr. Jerry Hausman. The event, hosted by the Wang Yanan Institute for Studies in Economics (WISE) at Xiamen University, focused on recent developments in econometric theory with applications. The one notable presentation was by Google’s Dr. Randall Lewis.

In his presentation, Dr. Lewis gave a practitioner’s overview about what it really means to work with Big Data. As a Google employee, it seemed to the audience that he was uniquely qualified to discuss the day to day troubles that come with analyzing petabytes of data. He instead started with a story about his first day as a Yahoo intern in 2008.

He had a problem. He needed to open a 2GB text file and had no idea how. He tried Notepad — it didn’t work. He tried importing it to Matlab — his computer couldn’t handle it. He exhausted every method he had used in the past to view or load data. All failed, and several days passed before he finally managed to open the file.

Like most economics Ph.D.s, Dr. Lewis’ education focused almost entirely on theory, but unlike most economics Ph.D.s, his internship gave him the opportunity to battle against large text files. He learned simple technical skills he wasn’t learning in his doctoral program, which ultimately allowed him to do his dissertation using data inaccessible to his colleagues.

He later told me that he wishes more Ph.D. students could have the same opportunities that he did. He was lucky and benefitted greatly from the practical parts of his education — they helped him finish his Ph.D. in just four years.

Econometric theory is inarguably important, as one cannot do proper causal inference without it. And as Dr. Lewis explained, causal inference is what separates economists from “data scientists”. Unfortunately for the economist, econometric theory doesn’t explain how to open a large text file via Unix command line.

Working with Big Data is cumbersome. Simple tasks, like opening or loading files, become complicated. Dr. Lewis explained that computer science and engineering students may finish school with the skills to deal with large data files, but many (if not most) students of economics do not. The simple models they learned to run as young econometricians also stop being so simple when performed on Big Data. “I just have to highlight that, in almost everything I do, it’s actually embarrassingly trivial, econometrically,” said Dr. Lewis during his talk. “I’m trying to work towards doing more advanced things, but you end up running into scalability constraints.”

Constraints are why the simple becomes difficult when doing economic analyses with Big Data. Hardware, computational power, time, funds, scalability, knowledge, etc. are all constrained, posing major challenges to the econometrician. When running a basic linear regression can cost you tens of thousands of dollars in electricity consumption, attempting a more computationally complex model just isn’t feasible.

If Big Data is the future of applied econometrics, then a strong background in econometric theory, while necessary, will no longer be sufficient for young econometricians looking to find work. They will also require the technical know-how to deal with terabytes of data. This combination is rare enough that Dr. Lewis has coined a term for what he does: Econinformatics.”

Half computer scientist, half economist. The Econinformatrician.



One final note. I wanted to highlight an interaction that occurred during Dr. Lewis’s presentation:

Dr. Lewis to the audience — “Who here, if I were to give you a 200GB gzipped file, could tell me how to read the first 3 lines of that file?”

One person (yours truly) rose their hand. There were roughly 110 people in the audience.

A professor from the audience — “With help from my RA, yes.”

[Audience laughs]

I’m sure Dr. Hausman will be fine without knowing how to read the first 3 lines of a 200 GB file. Personally, once I got back from China, I immediately began teaching myself more Unix command line and some SQL.


Big Data Requires a New Kind of Expert: The Econinformatrician

Big Data, Nutrition, and Food Taxes

A recent Huffington Post article discusses one of my papers with Michael Lovenheim at Cornell University. In this paper we analyze a series of unique Big Data sources tracking over 123 million food purchases over the period 2002-2007 in order to create a detailed model of food demand in the US. Understanding food demand  is important as obesity is a major public health concern world wide. Obesity kills more than 2.8 million people every year, according to the WHO. Today, over 2/3 of Americans are overweight and over 36% are obese. Estimates also suggest that about 30% of children are obese or overweight. The increases in obesity have been more pronounced among those with lower income, especially for women, as well as among non-Asian minorities. Obesity has been linked to a higher prevalence of chronic diseases, such as arthritis, diabetes and cardiovascular disease and the associated cost to the U.S. medical system has been estimated at about $147 billion per year, with Medicare and Medicaid financing approximately half such costs.

Using Big Data we simulate the role of product taxes on soda, sugar-sweetened beverages, packaged meals, and snacks, and nutrient taxes on fat, salt, and sugar. We find that nutrient taxes (e.g. on sugar) has a significantly larger impact on nutrition than an equivalent product tax (e.g. a soda tax), due to the fact that nutrient taxes are broader-based taxes. However, the costs of these taxes in terms of consumer utility are not higher. A sugar tax in particular is a powerful tool to induce healthier nutritive bundles among consumers, and appears to be more effective than other product or nutrient taxes.

Harding, M. and Lovenheim, M. (2013) “The Effect of Prices on Nutrition: Comparing the Impact of Product- and Nutrient-Specific Taxes”, NBER WP 19781.

Graphical representation of food expenditures on the 14 major product categories in the sample.

Graphical representation of food expenditures on the 14 major product categories for US households.
Graphical representation of food expenditures on the 14 major product categories for US households.

The size of the squares is proportional to the budget share of the corresponding product. The budget share is given in % under each product category name. The color shading of each rectangle corresponds to the price per ounce of products in each of the categories. The price per ounce in $ is also reported under the budget share for each category.

Big Data, Nutrition, and Food Taxes

Statistical significance in Big Data

Critical Values and Sample SizesAn interesting problem when analyzing Big Data is whether one should report the statistical significance of the estimated coefficients at the 1% level, instead of the more conventional 5% level. Intuitively a more conservative approach seems reasonable, but how do we decide exactly how conservative we ought to be?

It has been recognized for some time that when using large data it becomes “too easy” to reject the null hypothesis of no statistical significance, since confidence intervals are O(N^{-1})  (Granger, 1998). The problem with a standard t-test in large samples is that it is replaced by its asymptotic form and the critical values are drawn from the Normal distribution. As a result, for large sample sizes the critical value for testing at the 95% significance level does not increase with the sample size. One possibility for addressing this problem is to let the critical value be a function of the sample size.

My colleague, Carlos Lamarche, at the University of Kentucky, pointed out this week that one can think about this as a testing problem for nested models. Cameron and Trivedi (2005) suggest using the Bayesian Information Criterion (BIC) for which the penalty increases with the sample size. Using the BIC for testing the significance of one variable is identical to using a two-sided t-test critical value of \sqrt{ln(N)}.

The plot shows how the critical value increases with the scale of the data and how this compares with the standard critical values for the t-test at different levels of significance. Using the BIC suggests using critical values greater than 2 for sample sizes larger than 1000. When using Big Data with over 1M observations, a critical value equivalent to a t-test at the 99% or even 99.9% seems advisable.

Granger, C. W. J. (1998): “Extracting information from mega-panels and high-frequency data,” Statistica Neerlandica, 52(3), 258–272.

Cameron, C. A., and P. K. Trivedi (2005): Microeconometrics: Methods and applications. New York, NY: Cambridge University Press.

Statistical significance in Big Data

Big Data and Small Children

In a recent paper, Janet Currie, a Professor at Princeton University asks whether privacy concerns may not be responsible for missed opportunities in pediatrics research. Economists have emphasized for a long time the importance of early childhood interventions that can lead to a significant lifetime benefit. Big Data research on children is not easy to conduct because detailed administrative records are often unavailable to researchers. For example detailed birth certificates with precise information on infants and mothers are collected but generally not made available to qualified researchers. The trend in recent years has been to eliminate the access that was previously granted in some states such as Texas. While there are valid privacy questions and interesting ethical questions related to consent, I agree that it is a missed opportunity. Since much of this data is only available in aggregate form for large counties, it makes it difficult to draw causal inference about the drivers of infant and children outcomes. Data from third-party aggregators such as Acxiom is fairly accurate at the household level, but offers limited insights about the children in the household. This impedes much valuable public health research. In a recent SIEPR policy brief I look at the impact of drinking water contaminants on infant health using county level data and find significant impacts on birth weight and APGAR scores for a number of contaminants. Given the geographic spread of contaminated drinking water it would be very insightful to do this analysis by using the precise geographic location of each mother in order to measure her exposure. This is not possible today and we are missing out on an opportunity to precisely quantify the social cost of not enforcing environmental regulations.

Acute total coliform contamination 2002-2006
Acute total coliform contamination 2002-2006
Big Data and Small Children

Theory vs Measurement

Francis Diebold, Professor of Economics and Statistics at the University of Pennsylvania, challenges us in his 60-second lecture to consider that: “[…] theory gets too much respect, and measurement doesn’t get enough. Perhaps that will change in the emerging age of Big Data.”

Will we succeed in using Economics to transform Big Data Analytics, so it becomes less of a data mining technique and more of an Econometric tool?