A Chat with Randall Lewis (Google) About “Big Data”

I had an opportunity to chat with Dr. Lewis about his research and Google. I did not record the interview, so this is not a standard “Q and A”.

The 5 questions I asked Dr. Lewis were broadly categorized into 3 groups: 2 questions about his research, 2 questions about working at Google, and 1 general question. I’ve done my best to paraphrase Dr. Lewis.


Measuring the Effects of Advertising

1. Getting more data can be both a blessing and a curse. Often, although one may add more of the “signal” we are interesting in measuring, we also add more “noise.” In your studies of online advertising, how does this “increasing” noise affect your ability to use traditional statistical tools, such as the “5% significance level?”

The challenge of online advertising is that the effect of advertising exists and is typically both economically large and statistically tiny. A model for a well run experiment of a profitable ad campaign may yield a partial R-squared value of about .000005 — what most economists would consider to be zero. Such a tiny R-squared implies that one needs a massive sample size (2+ million unique users) to acquire statistically significant estimates.

This is where being Yahoo or Google, firms whose ad platforms reach large numbers of users, have a clear advantage of scale. They can run massive experiments involving ten or hundreds of millions of users. But this leads to another, bigger problem: the Data Generating Process (DGP).

Collecting data is no longer very difficult or expensive for firms. Quantity is not a problem. But quality — how data is generated aka the DGP — remains a persistent problem.

This is captured best in the classic phrase “garbage in, garbage out.” There is no shortage of garbage data and the growth of garbage data is higher than that of quality data. Garbage data is easy to generate. Quality data is not. Quality data — data that is free of bias — requires randomization and meticulous care and maintenance of the DGP.

When searching for a tiny effect, like in Dr. Lewis’s online advertising research, understanding the DGP is paramount. Any bias that cannot be properly controlled in your model can devastate your estimates, especially in settings where both the treatment (advertising) and the outcome (purchases) are observed by both the researcher and any other systems (ad servers and other optimization algorithms). In these circumstances, worst case bias scenarios (i.e. users who are most likely to buy a product see ads for the product) are not uncommon.

Ideally, researchers want control over the entire DGP because this facilitates alignment between the DGP and the model. There is no room for a brute force approach when trying to find such a faint signal in so much noise. In the particular case of online advertising, the model must be perfectly aligned to fit the DGP or the estimates will be inconsistent.

A perfect model is fragile and inflexible. Dr. Lewis said that one should think of models in the online advertising world like particle accelerators. A particle accelerator is exactly designed to find a very specific signal from a carefully generated set of experimental data. Bias leaking into the DGP via misaligned systems is akin to an earthquake misaligning a particle accelerator. Both are too finely tuned to handle such disturbances — and the data will demonstrate the disturbances in both systems.

According to Dr. Lewis, one begins to appreciate the difficulty and challenge posed by “Big Data” problem like online advertising when one attempts to design a model that is 99% correctly aligned with the DGP. This a dauntingly difficult task. And even if one succeeds, the results can be hard to believe.

Like, for example, that online advertising in certain cases may have little effect on consumer behavior and may not be worth the money. Take, for example, the study by economists at eBay who found that biased analyses likely cost eBay well over $100M in ineffective advertising expenditures that were erroneously pitched as eBay’s most effective ad spending.

2. From your paper, it would appear folks in the advertising world do not share the obsession economists have with bias. For data folks out there unfamiliar with bias, why is it such a big problem and why can’t throwing more data in the mix solve the problem?

Selection bias is an ever-present problem for economists. Targeted advertising, for example, is an industry standard. However, measuring the success of an ad campaign is near impossible, no matter how much data you have, without randomization. And most online advertising firms are not randomly targeting consumers (that would defeat the “target” in “targeted” advertising). What we have then is an industry dependent on biased data; those who see ads are fundamentally different from those who do not.

Perhaps paying for targeted ads is a good strategy during an ad campaign. Perhaps the targeted ads are working. But unless data generated from targeted ad campaigns includes random assignment, then the data produced will, by construction, be plagued by selection bias. This, in turn, makes it impossible to do any sort of causal inference about the effectiveness of the ad campaign. And yet, that is exactly what many advertising firms tend to do — make unsubstantiated causal claims.


At Google

3. For economists a ‘big’ data set can be a few gigabytes in size. I’m assuming this is laughably small for someone at Google. What’s it like to do data analysis at Google and what tools/techniques do you use for handling and making sense of such truly massive amounts of data?

Dr. Lewis summed up working with “Big Data” at Google succinctly:

“Big Data in practice is just glorified computational accounting.”

Data is generally collected for some basic business tabulation to settle accounts. For example, large advertising companies collect data on clicks and ad impressions primarily for billing purposes.

Additional “Big Data” applications have been built on this primary business case for revenue optimization. “Big Data” is now a race to leverage the new granularity of such accounting data. Naturally, these new applications can influence what data is recorded thanks to the efficiency of computational tools.

“Big Data” is a storage challenge. Day to day, if any data is needed, it is never downloaded raw. It would be too big. Enormous data sets are instead refined before extraction into much smaller sets using software like SQL or Hadoop.

As for doing analysis, there are loads of tools to choose from but it is best to learn analytical software that facilitates sharing and collaboration. Don’t be the only person using R when your team is using Python.

4. (Matt’s question) What skills and educational background does it take to get a job doing data analysis at google?

In a somewhat order-of-importance, Dr. Lewis suggested the following programming skills:

Engineering — C/C++, Java, Python, Go, SQL.
Quantitative Analysts (Data Scientists) — Python (SciPy/NumPy/Pandas), Linux/Bash/Core Utilities expertise, SQL, R (open-source), Matlab/Julia/Octave, Stata/SAS/SPSS (and other less broadly used languages).


General

5. What are your feelings about the mainstream explosion of the term “Big Data”?

Dr. Lewis is glad the term exists and that people are thinking about it, but he wants people to get real about it in the world of causal inference (“econinformatics”). For describing data (summary stats, etc.), “Big Data” has been great. For finding pockets of statistically informative and clean causality in data, “Big Data” has also been great. But “Big Data” in practice is more about arbitrarily precise correlations that tell us little about what most decision-makers care about: the causal effects.

A Chat with Randall Lewis (Google) About “Big Data”

Big Data Requires a New Kind of Expert: The Econinformatrician

 Simple Models + Big Data = Econinformatics

Economics Theory + Big Data = Econinformatics

I recently had the opportunity to attend a conference held in honor of the great econometrician Dr. Jerry Hausman. The event, hosted by the Wang Yanan Institute for Studies in Economics (WISE) at Xiamen University, focused on recent developments in econometric theory with applications. The one notable presentation was by Google’s Dr. Randall Lewis.

In his presentation, Dr. Lewis gave a practitioner’s overview about what it really means to work with Big Data. As a Google employee, it seemed to the audience that he was uniquely qualified to discuss the day to day troubles that come with analyzing petabytes of data. He instead started with a story about his first day as a Yahoo intern in 2008.

He had a problem. He needed to open a 2GB text file and had no idea how. He tried Notepad — it didn’t work. He tried importing it to Matlab — his computer couldn’t handle it. He exhausted every method he had used in the past to view or load data. All failed, and several days passed before he finally managed to open the file.

Like most economics Ph.D.s, Dr. Lewis’ education focused almost entirely on theory, but unlike most economics Ph.D.s, his internship gave him the opportunity to battle against large text files. He learned simple technical skills he wasn’t learning in his doctoral program, which ultimately allowed him to do his dissertation using data inaccessible to his colleagues.

He later told me that he wishes more Ph.D. students could have the same opportunities that he did. He was lucky and benefitted greatly from the practical parts of his education — they helped him finish his Ph.D. in just four years.

Econometric theory is inarguably important, as one cannot do proper causal inference without it. And as Dr. Lewis explained, causal inference is what separates economists from “data scientists”. Unfortunately for the economist, econometric theory doesn’t explain how to open a large text file via Unix command line.

Working with Big Data is cumbersome. Simple tasks, like opening or loading files, become complicated. Dr. Lewis explained that computer science and engineering students may finish school with the skills to deal with large data files, but many (if not most) students of economics do not. The simple models they learned to run as young econometricians also stop being so simple when performed on Big Data. “I just have to highlight that, in almost everything I do, it’s actually embarrassingly trivial, econometrically,” said Dr. Lewis during his talk. “I’m trying to work towards doing more advanced things, but you end up running into scalability constraints.”

Constraints are why the simple becomes difficult when doing economic analyses with Big Data. Hardware, computational power, time, funds, scalability, knowledge, etc. are all constrained, posing major challenges to the econometrician. When running a basic linear regression can cost you tens of thousands of dollars in electricity consumption, attempting a more computationally complex model just isn’t feasible.

If Big Data is the future of applied econometrics, then a strong background in econometric theory, while necessary, will no longer be sufficient for young econometricians looking to find work. They will also require the technical know-how to deal with terabytes of data. This combination is rare enough that Dr. Lewis has coined a term for what he does: Econinformatics.”

Half computer scientist, half economist. The Econinformatrician.

 


 

One final note. I wanted to highlight an interaction that occurred during Dr. Lewis’s presentation:

Dr. Lewis to the audience — “Who here, if I were to give you a 200GB gzipped file, could tell me how to read the first 3 lines of that file?”

One person (yours truly) rose their hand. There were roughly 110 people in the audience.

A professor from the audience — “With help from my RA, yes.”

[Audience laughs]

I’m sure Dr. Hausman will be fine without knowing how to read the first 3 lines of a 200 GB file. Personally, once I got back from China, I immediately began teaching myself more Unix command line and some SQL.

 


Big Data Requires a New Kind of Expert: The Econinformatrician