why you should take your ngrams with a grain of salt

By bioephemera on December 28, 2010.

Because people have been discussing Google ngrams a lot, and because there are always major caveats to new datamining methodologies, I have to link Natalie Binder's excellent series of posts urging caution, not only about the methodology, but about assuming too much about ngrams' utility in social research.

Binder says,

The value of the Ngrams Viewer rests on a bold conceit: that the number of times a word is used at certain periods of time has some kind of relationship to the culture of the time. For example, the fact that the word "slavery" peaks around 1860 suggests that people in 1860 had a lot to say about slavery. Another spike around the 1970s meshes nicely with the Civil Rights Movement.

Well, that's sort of interesting. However, I didn't need ngrams to tell me that a lot of people were writing about slavery in 1860. These data are broad but not deep, which makes them relatively useless to most humanities majors interested in intensive study. To understand the futility of trying to understand history this way, pretend that you've never heard of slavery, the Civll War or civil rights. Now take another look at the chart above. If Ngrams was your first encounter with the word "slavery," could you deduce that Americans owned slaves in the 1860s? Could you say anything other than, "slavery was a pretty big deal back then"? Probably not. But that is what the Google-Harvard team is suggesting we attempt to do, not necessarily with "slavery," but with many other words and ideas.

Binder is, as you can tell, extremely skeptical of ngrams' potential for analyzing trends in literature, even once the various OCR and metadata issues that produce false positives are cleaned up. Maybe she's a bit too skeptical. I have some sympathy for the position that we just don't need more heaps of imperfectly mined data. The internet and related technologies have already given us orders of magnitude more data than we have PhDs to dissertate it. And since I don't think graduate students should be treated as mere instrumentalities to embiggen our national knowledge base, I think we should train fewer PhDs, not more.

On the other hand, the relevant questions are how to ensure the data being mined is high-quality, how to filter out systematic errors, and how to devise questions that maximize the strengths of the data rather than just flailing at it with naive curiosity. And those, really, have always been the important questions about large datasets. They're the same questions I asked when I was 19 years old, counting several thousand mutant fruit flies, and realized halfway through that I'd scored a poorly penetrant allele of achaete as forked (or something like that) and had to start all over. The data is only as good as the filter - person or technology - reading it.

So it's not enough to have piles of data, as intoxicating as the prospect may be. You have to know the data contains what you're looking for, and figuring out how to work around its weaknesses (which means knowing what those weaknesses are). It may be obvious that something is haywire when the word "internet" spikes in the 1920s, but it won't be so obvious with most artifacts. Nobody said science was easy. . . and slapping a "Google" on it certainly doesn't make it so.

More like this

Schadenfreude Explosion!

If you are fascinated with word usage, I suggest you try a powerful new tool, Google NGram Viewer. According to the website:

Best Google n-gram yet: how thinking about death changed in 1767

I'm sure Google Ngrams needs no introduction, but in case I'm wrong: it's a nifty (if crude and much-misused) tool for investigating the frequency of written

Rush Limbaugh and the Liberal Polar Vortex

Here's the thing: Shauna Theel has a video on the Polar Vortex vs. Rush Limbaugh: Published on Jan 9, 2014

Two Odd Examples of Pre Ebola "Ebola"

I used Google N-gram Viewer to inspect the occurrence of the word "Ebola" in the Google-indexed literature. A few instances of Ebola came up earlier than the disease being known, so I figured they were references to the place name in Zaire/Congo, after which the disease is named.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

Goodbye to Scienceblogs

September 15, 2011

A few weeks ago, I was notified that if I wished to continue blogging at Scienceblogs/National Geographic, I'd have to agree to new terms. After considering these terms, as well as the decision to ban pseudonymous blogging, I don't feel that the new management and I are on the same page. I have…

SpaceChem!

September 14, 2011

A few months ago I got an email from Zachtronics, creators of the Codex of Alchemical Engineering, about the new indie game called SpaceChem. It was billed as "an obscenely addictive, design-based puzzle game about building machines and fighting monsters in the name of science." What's not to love…

Mechanical butterfly, circa 1911

September 14, 2011

Check out this great slideshow of fascinating advertising novelties from 1911, over at Scientific American.

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

September 14, 2011

Recently, Scienceblogs/National Geographic decided it would no longer host pseudonymous science bloggers. As a result, many of my former colleagues have left. I think this decision was wrong. Read on for my reasons. One: simple fairness. Several well-established pseudonymous bloggers had been…

Seeing the invisible? There's an app for that

September 8, 2011

This video from Xperia Studio very effectively conveys how data visualization can both leverage and challenge our conceptions of "reality." The night sky we've seen since childhood, like everything else we see, is just a tiny slice of the spectrum - only what we can perceive with our limited…

why you should take your ngrams with a grain of salt

More like this

Schadenfreude Explosion!

Best Google n-gram yet: how thinking about death changed in 1767

Rush Limbaugh and the Liberal Polar Vortex

Two Odd Examples of Pre Ebola "Ebola"

Goodbye to Scienceblogs

SpaceChem!

Mechanical butterfly, circa 1911

Pseudonymity: Five Reasons the New Scienceblogs/NG Policy is Misguided

Seeing the invisible? There's an app for that

Science, Intrigue and Mystery Top Agilent's Festival Exhibit

Dark Energy, Dark Flow, and can we explain it away?

Black Holes Won't Incinerate You, After All