Reading graphs: How we do it, and what it tells us about making better ones

Take a look at this graph showing population distribution by county in a fictional U.S. state:

i-99eb3ae35a379cedf485719caaefd4ff-ratwani1.png

ResearchBlogging.orgHow do you read such a graph? Is this the ideal way to depict this sort of information? If you wanted to know which part of the state was most populous, how would you go about figuring it out? Researchers have developed conflicting models to explain how it's done. One model suggests that people reading this kind of graph must cycle between the different parts in order to understand it. This makes some sense: to answer our question about population, you'd have to look back and forth between the legend and the colors on the map.

Another model says that how we read graphs like this depends on the question. We'd answer "what's the population of Knox County?" differently than "How is the population distributed across the state?" The first question just asks graph readers to extract information from the graph, while the second question demands that readers integrate information from the entire graph, building an understanding of relationships between its parts.

Integration is clearly the more complex task, and the one that graph-makers are probably most interested in. But how do we identify the relevant parts of the graph, and what else are we doing to integrate the information on a graph? A team led by Raj Ratwani showed viewers several different graphs like the example above, asking them to talk their way through answering a variety of questions about each graph.

As you might expect, their reasoning was indeed different based on the type of question being asked. For questions like "What is the population of X county?" most people just gave an answer, perhaps searching the graph for a bit before responding. For integrative questions, very few respondents could answer right away, and instead spent time rethinking their responses, getting more information, and building a pattern before offering a response.

In a second experiment, the researchers used an eye-tracking device to see where respondents were looking while they answered questions about the graphs. They also simplified the graphs by removing the county names and replacing them with single letters, like this:

i-ffebf09fd93a4263a9243d9b1b3ab4ba-ratwani2.png

So where were they looking? Once again, it depended on the type of question being asked. As you can see from the figure above, counties can be grouped into clusters based on population. These clusters are defined largely by their outer boundaries. So the researchers counted how frequently viewers looked at these outer boundaries versus the interior boundaries of counties within the same cluster. This graph shows the results:

i-ba7c46bd79555821beca8258335b984b-ratwani3.png

The chart shows how frequently the viewers fixed their eyes on the outer boundaries of a cluster. As questions became more complex, a significantly larger portion of fixations were on the outer boundaries. Progressively fewer fixations focused on interior boundaries between counties.

The researchers also analyzed the transitions between fixations -- not just where viewers were looking, but the paths they took to get there. Here are those results:

i-dfc33f1d48b0f70f0b0953213b74b0a7-ratwani4.png

As graphs became more complex, viewers spent more time moving their eyes from cluster to cluster, and less time looking from a cluster to the legend.

Ratwani's team says this all suggests that viewers are using a more complicated process to read these graphs than previous research suggested. First they must integrate the graph visually -- that is, determine which cluster goes with which data. Then, the cognitively integrate -- figure out the relationship between the clusters.

The researchers offer a few suggestions on how to make better graphs based on their research:

Visual Integration

  • Make the boundaries of clusters of data more obvious (for example, by making the lines between similar groups bolder)
  • Use color schemes that make the differences between groups clear: Use lots of different colors, not just shades of gray
  • Remove extraneous markings like the county names on the maps.

Cognitive integration

  • Make the relationship between the legend and the items on the graph obvious (for example, by using consistent colors throughout a paper or presentation)
  • Don't use too many different colors (decrease the total number of clusters)

To this, I'd like to add a few of my own. First of all, the tradition in journals and books of labeling figures with numbers and placing them far away from their descriptions in the text should be scrapped. I'd rather see a partial page or even a blank page in a book if it meant that the figure actually appeared next to its description in the text. Online journals should place figures inline, not make you click to view them separately.

A related quibble: Books and journals have a tradition of *never* placing figures before their textual description, even if the textual description would appear on the same page (or two-page spread) as the figure. Once again, there's no reason for this. It's much better for a figure to appear a half-page before the textual reference than five pages after it.

Finally, figures should be clearly marked. Don't use abbreviations or shortcuts in the legend. Say what you mean! Researchers often use abbreviations as shortcuts while they're doing preliminary data analysis in the lab. This doesn't mean you have to use those same shortcuts when reporting your data to the public.

Raj M. Ratwani, J. Gregory Trafton, Deborah A. Boehm-Davis (2008). Thinking graphically: Connecting vision and cognition during graph comprehension. Journal of Experimental Psychology: Applied, 14 (1), 36-49 DOI: 10.1037/1076-898X.14.1.36

More like this

At least you haven't showed an example of the most egregious graph evilness--which is ubiquitous in financial/economic reporting--of the displaced axis. The best example of this is the daily stock market numbers, where the y-axis starts at 8000 and tops off at 9000, making it look visually like a 100 point excursion is ZOMFG!!!! 10%!!!!!!!! change, when it is really about 1%.

I despise that shit.

I guess the first question to ask before representing data is, "What questions are likely to be asked, where this data is to be used to inform an answer?" Examples might be, where in the state would be good places to permit new or expanded hog barns or other intensive livestock operations, where to locate a new regional airport, primary trauma center at an existing hospital, encroachment of suburbs into agricultural land. These charts are useful but not as useful as they could be.
A better way (for this case) might be to maintain the color scheme but represent each county's population by an area within the county outline showing how big it would have to be to contain all the county's population at the highest county's population density.
A county with 9,000 people per square mile would be about 20% red; a county with 500 people per square mile would be about 1% red, and a county with 11,000 people per square mile would be maybe 22% orange.

Physio: Eon't forget the red state/blue state graphs where Montana and Wyoming are as big and brightly colored as California and New York.

Isn't the first graph screaming out to have the state size strictly proportional to population? The red counties seem to overwhelm it, yet they're the smallest band

Using shades of gray is better than using many colors, because shades of gray can be a scale, whereas colors can't. In the population graph shown, you wouldn't need to look back and forth between the graph and the legend more than once if white indicated the smallest population density, and darker grays indicating progressively higher densities.

I do agree with Phil, but also feel that color can just as effectively be used if it's within an easily understood "scheme." For example, going from yellow, to orange, to red to indicate increasing values. (I found the color choices of the graph above to be confusing, as red indicated the lowest amounts, and a seemingly random blue for the upper limits.)

If not shades of gray, shades of the same color would work equally well; it could even be used to clearly indicate varying degrees of more than one variable on the same graph. Just use a different color scale per variable!

I do agree with Phil, but also feel that color can just as effectively be used if it's within an easily understood "scheme." For example, going from yellow, to orange, to red to indicate increasing values. (I found the color choices of the graph above to be confusing, as red indicated the lowest amounts, and a seemingly random blue for the upper limits.)

If not shades of gray, shades of the same color would work equally well; it could even be used to clearly indicate varying degrees of more than one variable on the same graph. Just use a different color scale per variable!

That's "graph" meaning an abbreviation for "graphic" eh?
Would calling it an "illo" (for "illustration") help?

By Hank Roberts (not verified) on 29 Jan 2009 #permalink

Hank:

No, it's "graph" meaning a visual representation of data. Were you thinking a "graph" could only represent a mathematical function?

Interestingly, in the printed journal article, those "map graphs" are shown using shades of gray, with the darkest gray corresponding to the highest population (as recommended in comments 4 & 5). I only skimmed the article to see the graphs, so I don't know if that's what they used in the experiment or not.

Cartographers have researched and worked out rules of thumb for displaying data, and Edward Tufte has also discussed some of the issues thoughtfully. It's worth spending some time on as there is more than a little art to it and your approach will vary with the data and what you are trying to communicate. In general while a simpler image may be easier to understand, sometimes you have to produce a graphic that repays deeper study.

For starters, data may be discrete or have implied continuity. Color ramps can be used to show relationships but should be used with care since a portion of your audience will be color blind. Artists generally try to choose schemes that will read reasonably well in color, in black and white, and for for the colorblind (3 different issues). There are on-line services that let you submit an image and simulate how it looks to a color blind person (who will tend to see the relationships differently).

Color and even grayscale are tricky anyway because how they are perceived is context dependent, and perception of steps isn't linear in any case. Color in particular has some unusual properties as well. For instance, areas of color don't define form as well as hard black lines. More alarming, the colors in the examples given are harsh and competing with each other for attention which makes reading difficult.

By Radge Havers (not verified) on 29 Jan 2009 #permalink

Single tone color schemes, grey or a single hue, work well for indicating spectra of values. They also are easy for the 10% of the male population that is color-blind.

I disagree about putting the figure before its description in the text.

The reader needs to know what to look for before tackling the figure. It's the author's job to explain the results, not the reader's to pore over the illustration, trying to figure out what its point is.

The author should guide the reader with a statement along the lines of "The population of Big State is highly concentrated in the counties immediately surrounding Big City, as seen in Figure X", if this is indeed the idea (s)he is trying to get across.

Plunking an unexplained figure in front of an unprepared reader is a failure of the author's responsibility. I spend a lot of my pedagogic effort trying to teach students not to simply paste their illustrations into their paper and think they've produced a results section.

Anyone interested in presenting their data in graphical form should read Tufte's "The Visual Display of Quantitative Information" for how tos, how not tos, good examples, bad examples and history (Playfair et al.).

In a world where most people are going to use the default settings in whatever "graphing" software they have (mostly Excel), the real decisions that affect user's understanding were made decades ago by a programmer for reasons that had little to do with how people process visual data.

In my field, 90% of the graphics I see don't even require a legend, but they all have them because Excel puts one in by default.

Tufte is brilliant, and one of his most worrying observations is that we've been captured by the software tools. Read his bit on the role of that PowerPoint played in the lead up to the Challenger disaster. He's spot on!

The graphic would be more intuitive for me if the densest populations were red and less dense orange, yellow, and blue. Intensity = crowding = heat. That kind of automatic "physical" interpretation helps me.

Well, I read the article and then everyone's comments with interest -- and some pain of course at the obvious flaws we all can see with the population density map.

The issues pointed out -- the flat color scheme, the uninformative legend, etc, etc -- are precisely on my radar on a daily basis; I work for UUorld (before you ask, it's pronounced "World"!) Inc., a small mapping and data visualization company based in San Francisco. It is in attempting to create software that by its very nature repels those flaws that I have become so aware of them in general in the world of visualizing data.

It intrigues me deeply that a study in cognitive science has been done to actually measure, in effect, the extent to which one must think to glean information from a map. Maybe the results are obvious -- that poorly organized maps are less useful than good ones. But in any event to measure brain activity has remarkable potential to validate the sort of work our company (and others) do.

I thought it would be interesting to approach the findings of Mr. Ratwani's study firsthand, so I created some real-world sample maps using counties in Maryland and created a series of graphs to improve upon the trouble spots exposed by the study (and this thread). I posted the results here, along with explanations of what I did: http://www.uuorld.com/blog

I especially appreciated BobS's comment that most people are yoked by the default settings of whatever tool they use to create (or even to analyze) graphs ... I would like to think our software is a step in the right direction, towards rich and immersive visualizations that tell a story ... not spawn further confusion. Anyone is welcome to a free download at our website.

Seems strange to me that the people doing this study don't make a recommendation to integrate the legend with the graphs, rather than have them way over there on the side.

If the eye is bolting back and forth so much, maybe the information way over on the right should be placed right where the eye is fixating in the colored regions.

Same idea could help those bar graphs. I found them quite confusing (in their terminology and the placement of relevant information).