I recently had the opportunity to attend a conference held in honor of the great econometrician Dr. Jerry Hausman. The event, hosted by the Wang Yanan Institute for Studies in Economics (WISE) at Xiamen University, focused on recent developments in econometric theory with applications. The one notable presentation was by Google’s Dr. Randall Lewis.
In his presentation, Dr. Lewis gave a practitioner’s overview about what it really means to work with Big Data. As a Google employee, it seemed to the audience that he was uniquely qualified to discuss the day to day troubles that come with analyzing petabytes of data. He instead started with a story about his first day as a Yahoo intern in 2008.
He had a problem. He needed to open a 2GB text file and had no idea how. He tried Notepad — it didn’t work. He tried importing it to Matlab — his computer couldn’t handle it. He exhausted every method he had used in the past to view or load data. All failed, and several days passed before he finally managed to open the file.
Like most economics Ph.D.s, Dr. Lewis’ education focused almost entirely on theory, but unlike most economics Ph.D.s, his internship gave him the opportunity to battle against large text files. He learned simple technical skills he wasn’t learning in his doctoral program, which ultimately allowed him to do his dissertation using data inaccessible to his colleagues.
He later told me that he wishes more Ph.D. students could have the same opportunities that he did. He was lucky and benefitted greatly from the practical parts of his education — they helped him finish his Ph.D. in just four years.
Econometric theory is inarguably important, as one cannot do proper causal inference without it. And as Dr. Lewis explained, causal inference is what separates economists from “data scientists”. Unfortunately for the economist, econometric theory doesn’t explain how to open a large text file via Unix command line.
Working with Big Data is cumbersome. Simple tasks, like opening or loading files, become complicated. Dr. Lewis explained that computer science and engineering students may finish school with the skills to deal with large data files, but many (if not most) students of economics do not. The simple models they learned to run as young econometricians also stop being so simple when performed on Big Data. “I just have to highlight that, in almost everything I do, it’s actually embarrassingly trivial, econometrically,” said Dr. Lewis during his talk. “I’m trying to work towards doing more advanced things, but you end up running into scalability constraints.”
Constraints are why the simple becomes difficult when doing economic analyses with Big Data. Hardware, computational power, time, funds, scalability, knowledge, etc. are all constrained, posing major challenges to the econometrician. When running a basic linear regression can cost you tens of thousands of dollars in electricity consumption, attempting a more computationally complex model just isn’t feasible.
If Big Data is the future of applied econometrics, then a strong background in econometric theory, while necessary, will no longer be sufficient for young econometricians looking to find work. They will also require the technical know-how to deal with terabytes of data. This combination is rare enough that Dr. Lewis has coined a term for what he does: “Econinformatics.”
Half computer scientist, half economist. The Econinformatrician.
One final note. I wanted to highlight an interaction that occurred during Dr. Lewis’s presentation:
Dr. Lewis to the audience — “Who here, if I were to give you a 200GB gzipped file, could tell me how to read the first 3 lines of that file?”
One person (yours truly) rose their hand. There were roughly 110 people in the audience.
A professor from the audience — “With help from my RA, yes.”
I’m sure Dr. Hausman will be fine without knowing how to read the first 3 lines of a 200 GB file. Personally, once I got back from China, I immediately began teaching myself more Unix command line and some SQL.