Big Data, Nutrition, and Food Taxes

A recent Huffington Post article discusses one of my papers with Michael Lovenheim at Cornell University. In this paper we analyze a series of unique Big Data sources tracking over 123 million food purchases over the period 2002-2007 in order to create a detailed model of food demand in the US. Understanding food demand  is important as obesity is a major public health concern world wide. Obesity kills more than 2.8 million people every year, according to the WHO. Today, over 2/3 of Americans are overweight and over 36% are obese. Estimates also suggest that about 30% of children are obese or overweight. The increases in obesity have been more pronounced among those with lower income, especially for women, as well as among non-Asian minorities. Obesity has been linked to a higher prevalence of chronic diseases, such as arthritis, diabetes and cardiovascular disease and the associated cost to the U.S. medical system has been estimated at about $147 billion per year, with Medicare and Medicaid financing approximately half such costs.

Using Big Data we simulate the role of product taxes on soda, sugar-sweetened beverages, packaged meals, and snacks, and nutrient taxes on fat, salt, and sugar. We find that nutrient taxes (e.g. on sugar) has a significantly larger impact on nutrition than an equivalent product tax (e.g. a soda tax), due to the fact that nutrient taxes are broader-based taxes. However, the costs of these taxes in terms of consumer utility are not higher. A sugar tax in particular is a powerful tool to induce healthier nutritive bundles among consumers, and appears to be more effective than other product or nutrient taxes.

Harding, M. and Lovenheim, M. (2013) “The Effect of Prices on Nutrition: Comparing the Impact of Product- and Nutrient-Specific Taxes”, NBER WP 19781.

Graphical representation of food expenditures on the 14 major product categories in the sample.

Graphical representation of food expenditures on the 14 major product categories for US households.
Graphical representation of food expenditures on the 14 major product categories for US households.

The size of the squares is proportional to the budget share of the corresponding product. The budget share is given in % under each product category name. The color shading of each rectangle corresponds to the price per ounce of products in each of the categories. The price per ounce in $ is also reported under the budget share for each category.

Big Data, Nutrition, and Food Taxes

Statistical significance in Big Data

Critical Values and Sample SizesAn interesting problem when analyzing Big Data is whether one should report the statistical significance of the estimated coefficients at the 1% level, instead of the more conventional 5% level. Intuitively a more conservative approach seems reasonable, but how do we decide exactly how conservative we ought to be?

It has been recognized for some time that when using large data it becomes “too easy” to reject the null hypothesis of no statistical significance, since confidence intervals are O(N^{-1})  (Granger, 1998). The problem with a standard t-test in large samples is that it is replaced by its asymptotic form and the critical values are drawn from the Normal distribution. As a result, for large sample sizes the critical value for testing at the 95% significance level does not increase with the sample size. One possibility for addressing this problem is to let the critical value be a function of the sample size.

My colleague, Carlos Lamarche, at the University of Kentucky, pointed out this week that one can think about this as a testing problem for nested models. Cameron and Trivedi (2005) suggest using the Bayesian Information Criterion (BIC) for which the penalty increases with the sample size. Using the BIC for testing the significance of one variable is identical to using a two-sided t-test critical value of \sqrt{ln(N)}.

The plot shows how the critical value increases with the scale of the data and how this compares with the standard critical values for the t-test at different levels of significance. Using the BIC suggests using critical values greater than 2 for sample sizes larger than 1000. When using Big Data with over 1M observations, a critical value equivalent to a t-test at the 99% or even 99.9% seems advisable.

Granger, C. W. J. (1998): “Extracting information from mega-panels and high-frequency data,” Statistica Neerlandica, 52(3), 258–272.

Cameron, C. A., and P. K. Trivedi (2005): Microeconometrics: Methods and applications. New York, NY: Cambridge University Press.

Statistical significance in Big Data

Big Data and Small Children

In a recent paper, Janet Currie, a Professor at Princeton University asks whether privacy concerns may not be responsible for missed opportunities in pediatrics research. Economists have emphasized for a long time the importance of early childhood interventions that can lead to a significant lifetime benefit. Big Data research on children is not easy to conduct because detailed administrative records are often unavailable to researchers. For example detailed birth certificates with precise information on infants and mothers are collected but generally not made available to qualified researchers. The trend in recent years has been to eliminate the access that was previously granted in some states such as Texas. While there are valid privacy questions and interesting ethical questions related to consent, I agree that it is a missed opportunity. Since much of this data is only available in aggregate form for large counties, it makes it difficult to draw causal inference about the drivers of infant and children outcomes. Data from third-party aggregators such as Acxiom is fairly accurate at the household level, but offers limited insights about the children in the household. This impedes much valuable public health research. In a recent SIEPR policy brief I look at the impact of drinking water contaminants on infant health using county level data and find significant impacts on birth weight and APGAR scores for a number of contaminants. Given the geographic spread of contaminated drinking water it would be very insightful to do this analysis by using the precise geographic location of each mother in order to measure her exposure. This is not possible today and we are missing out on an opportunity to precisely quantify the social cost of not enforcing environmental regulations.

Acute total coliform contamination 2002-2006
Acute total coliform contamination 2002-2006
Big Data and Small Children

Theory vs Measurement

Francis Diebold, Professor of Economics and Statistics at the University of Pennsylvania, challenges us in his 60-second lecture to consider that: “[…] theory gets too much respect, and measurement doesn’t get enough. Perhaps that will change in the emerging age of Big Data.”

Will we succeed in using Economics to transform Big Data Analytics, so it becomes less of a data mining technique and more of an Econometric tool?

Video