why you should take your ngrams with a grain of salt

Because people have been discussing Google ngrams a lot, and because there are always major caveats to new datamining methodologies, I have to link Natalie Binder's excellent series of posts urging caution, not only about the methodology, but about assuming too much about ngrams' utility in social research.

Binder says,

The value of the Ngrams Viewer rests on a bold conceit: that the number of times a word is used at certain periods of time has some kind of relationship to the culture of the time. For example, the fact that the word "slavery" peaks around 1860 suggests that people in 1860 had a lot to say about slavery. Another spike around the 1970s meshes nicely with the Civil Rights Movement.

Well, that's sort of interesting. However, I didn't need ngrams to tell me that a lot of people were writing about slavery in 1860. These data are broad but not deep, which makes them relatively useless to most humanities majors interested in intensive study. To understand the futility of trying to understand history this way, pretend that you've never heard of slavery, the Civll War or civil rights. Now take another look at the chart above. If Ngrams was your first encounter with the word "slavery," could you deduce that Americans owned slaves in the 1860s? Could you say anything other than, "slavery was a pretty big deal back then"? Probably not. But that is what the Google-Harvard team is suggesting we attempt to do, not necessarily with "slavery," but with many other words and ideas.

Binder is, as you can tell, extremely skeptical of ngrams' potential for analyzing trends in literature, even once the various OCR and metadata issues that produce false positives are cleaned up. Maybe she's a bit too skeptical. I have some sympathy for the position that we just don't need more heaps of imperfectly mined data. The internet and related technologies have already given us orders of magnitude more data than we have PhDs to dissertate it. And since I don't think graduate students should be treated as mere instrumentalities to embiggen our national knowledge base, I think we should train fewer PhDs, not more.

On the other hand, the relevant questions are how to ensure the data being mined is high-quality, how to filter out systematic errors, and how to devise questions that maximize the strengths of the data rather than just flailing at it with naive curiosity. And those, really, have always been the important questions about large datasets. They're the same questions I asked when I was 19 years old, counting several thousand mutant fruit flies, and realized halfway through that I'd scored a poorly penetrant allele of achaete as forked (or something like that) and had to start all over. The data is only as good as the filter - person or technology - reading it.

So it's not enough to have piles of data, as intoxicating as the prospect may be. You have to know the data contains what you're looking for, and figuring out how to work around its weaknesses (which means knowing what those weaknesses are). It may be obvious that something is haywire when the word "internet" spikes in the 1920s, but it won't be so obvious with most artifacts. Nobody said science was easy. . . and slapping a "Google" on it certainly doesn't make it so.

More like this