I'll bet you don't understand error bars (updated with answers)

Cognitive Daily gets a lot of complaints about graphs, mostly from readers who say the graphs are useless without error bars. My response is that error bars are confusing to most readers. But perhaps I'm wrong about that. Now I'm going to put my money where my mouth is.

Take a look at this graph. It represents a fictional experiment where two different groups of 50 people took a memory test. The mean scores of each group are shown, along with error bars showing standard error:

i-1b449714eea94e25d9fd77a0fc8b0430-graph1.gif

Based on this graph, can you tell if there is a significant difference (p<.05 between="" the="" scores="" of="" two="" groups="" let="" make="" this="" a="" poll="" sake="" accuracy="" please="" respond="" as="" best="" you="" can="" even="" if="" don="" know="" what="" error="" bars="" represent="">

Below I've included a similar graph, again testing two different groups of 50 people but using a different type of error bar:

i-c9b33de9ecdf7793bcc98f98c6ac1823-graph2.gif

Again, based on this graph, can you tell if there is a significant difference (p<.05 between="" the="" scores="" of="" two="" groups="">

I'll give the correct answers later today (after plenty of folks have had a chance to respond), but I'll wager now that we will get a large number of incorrect responses to each poll, even though many of our readers are active researchers.

The Bet
Here's my wager. I say that fewer than 50 percent of our readers can accurately answer the poll questions without cheating or looking up the information elsewhere first. If we get more than 300 responses to each poll, and accuracy is better than 50 percent for each, then I'll add error bars to every graph I produce for Cognitive Daily from here on out (as long as the researchers publish enough information for me to generate them -- and as long as the error bars are statistically relevant [more on that later]). If not, then I get to link to this post every time a commenter complains about Cognitive Daily not putting error bars in its graphs.

Update: Okay, I think we've now gotten enough answers to demonstrate that most of our readers don't understand error bars. I win! (I probably shouldn't be too happy about that though...)

I'll post the answers below -- you'll need to scroll down to see them so people can still answer the poll without seeing the answer first.

Now, the answers. For Graph 1, the correct response is "too close to call." Since the error bars are standard errors, they need to be separated by at least half the length of the error bars. I'll give partial credit for a "yes" answer, since they are actually separated by exactly half the length of the error bars. (For more explanation, see this post)

For Graph 2, the correct response is "yes." Since the error bars are 95% confidence intervals, they can overlap by as much as 25% of the length of the error bars and still show a significant difference. These error bars just barely overlap, so the difference is definitely significant.

Any comments on the polls? I tried to make the instructions as clear as possible, but I'm open to hearing any claims as to how my test may have slanted the results.

More like this

I saw you palm that card. Your readers complain that bar graphs should have error bars, and you claim they can't interpret error bars. You demonstrate this by showing that they can't interpret standard error. That's not the same thing as "error bars."

More seriously, even if you are right that not being able to interpret the standard errors means you shouldn't graphically present them, you've also shown that they can't interpret the means, so you shouldn't be presenting any graphs or numbers at all!

Nobody owes it to you to accept your bar graphs at face value just because you can't prove they mean anything real.

I know absolutely nothing about error bars, however, if you were to include a brief explanation (similar to the ones you included at the bottom of this post), I would appreciate you including them in future posts.

The explanations "Since the error bars are standard errors, they need to be separated by at least half the length of the error bars." and "Since the error bars are 95% confidence intervals, they can overlap by as much as 25% of the length of the error bars and still show a significant difference." were very helpful to me.

Hi, biostatistician here. I agree that most people don't really understand error bars completely, but I also don't think your demonstration is particularly valuable. Among the uses of error bars, the very few special cases in which the display can be used to conduct an ad hoc significance test are probably the least important. To me, the whole point of interval estimation is that it is more informative than simple point estimation, with or without a p-value. Psychologists are far too reliant on .05 significance tests!

I think a better experiment would have been to display two identical bar charts each comparing two groups of vastly unequal size, one with error bars, one without. The chart with error bars, of course, would compellingly demonstrate the relative lack of precision in the point estimate for the smaller group versus the larger group, even for folks who may not completely understand the relationship between standard error and hypothesis tests. The poll then could have been: "Which of these two charts is more informative?" I know what my answer would have been! (We'll put aside for the moment the fact that bar charts are an extremely wasteful way of displaying, essentially, two numbers to begin with.)

Also (I know, this is a bit pedantic), both of your significance conclusions rely completely on unstated distributional assumptions.

I think this demo misses the point of displaying error bars. Error bars provide more than a simple visual display of significance. Most people trust, for better or worse, that the statistical analyses were done properly in peer-reviewed research, and will accept the reported p less than .00x at face value.

Instead, what *I* get out of standard-error error bars is visual hint at the variance in each group. If there's a ton of variance around a variable, it makes me wonder if the experiment was well-designed, or if there are ways to improve it to get more reliable measurements. If the size of the error bar is dramatically different between two groups, I wonder if outliers can account for the statistical between-group differences, and I wonder if there was appropriate homogeneity within each group. And so on.

Error bars basically give another view of the raw data, and they're essentially free to display, so who really cares if people don't understand exactly how they relate to your p-values? That's not the point. Presumably, you've already told me your p-values, anyway, right?

Hey I answered Yes to both! so what do I win?

I look at the graphs and just "guessed" that the difference in the mean scores was "big enough" that the error bars would have no effect on the answer.

I remember looking at some graphs where the error bars were high for each data point because they were measuring gamma ray bursts or novas but that the pattern of all the data points was (relatively) linear. Showed the data fit the line but that the result was somewhat forced. except that it seemed the right relationship even with the error ranges.

measurements is of great interest.

I wonder how many people would have got the correct results if you handn't shown any error bars.

Great experiement and I have learned something. Now I challenge you not to dumb down here and go forward with error bars build in links to explain things for us.

In re (1): I don't think it's too close to call. I read from your graph

x1 = 75
x2 = 50
se1 = 8
se2 = 8

and you state N1=N2=50. Hence nu=98, so the t-test is essentially the Z-test. But using the TDIST function in Excel, with se=SQRT(se1^2+se2^2), I get p=0.029, which is less than your alpha value, and it is not too close to call.

In re (2), I'll bet a significant fraction of your polled population did what I did: I scrolled by clicking the thumb bar, read your description, pulled the numbers from the graph, and found the p~20%>5% and voted no. What I did not do (and admittedly should have done) is read the caption (which did not appear on the same pageview as the graph on my screen), which is the only place you mention that the bars are 95% confidence intervals. What you have shown is either (a) you don't know how to design a web-based poll, or (b) people like me don't read captions for stuff like this. What you have not shown is that I don't know what a 95% confidence interval is. What you have especially not shown is that showing standard error is pointless.

Regards
BBB

BBB,

In regards to the first question, I seriously doubt more than a fraction of participants went through the calculations you did. I'm relying on the Cumming and Finch article in the Feb-March 2005 American Psychologist for my interpretation of the answer, and they make a compelling case for the rule of thumb that a gap of 1/2 of the width of an SE bar corresponds to p=.05.

For the second question, I specifically state in the text that these error bars are different from the first question, and the caption indicates what they are. This is more information than you'll get in most journal articles, which only explain what type of error bars are used in figure captions.

If people aren't reading captions to understand what error bars are being used, then clearly they aren't really understanding error bars, since 95 percent confidence intervals are generally about twice as wide as SE bars -- which was my point from the outset.

Dear Dave,

I am a computational physicist and a professor of engineering. I strongly recommend against the use of 95% confidence intervals for reasons related to your post. I believe it is too easy to confuse them with standard error bars unless you read the fine print, and there is a reason the latter are called "standard" ;-) I also require all my students to compute and to display standard error bars on their data. As another physicist friend of mine has said, "If you don't know the uncertainty in your data, you really haven't shown anything."

As far as why I didn't read the caption, all I can say is that it didn't display when I scrolled to that part of the quiz -- it was at the top of the next screenful of information. I was puzzled by the reference to "a different type of error bar" but "different" could have meant "overlapping". It frankly never occurred to me that you could have meant "confidence interval" when you wrote "error bar" because confidence intervals and error bars are two different things.

Regards
BBB

BBB,

So you're saying that researchers shouldn't use confidence intervals because people frequently confuse them with standard error?

That's pretty similar to my original reason for not reporting standard error or confidence interval.

That said, I'm not statistician, but I do know that "error bar" refers to more than just standard error of the mean. For example, you could also use it to depict measurement error, when known.

As the Belia et al. study shows, researchers make mistakes interpreting both confidence intervals and error bars, so it's difficult to use "confusion" as a reason for reporting one or the other type. Cumming and Finch, in the article I mention above, recommend confidence intervals, because they have a direct relation to the p-value for a data point.

>>So you're saying that researchers shouldn't use confidence intervals because people frequently confuse them with standard error?<<

I apologize for the misunderstanding -- as I added in the Compuserve thread that is mirroring this conversation, I should instead have said "I strongly recommend against the use of 95% confidence intervals in data plots..." That is, showing a confidence interval as an error bar in a data plot is inherently misleading, as is the indefensible practice of doubling the error bar because "2" is close to the 2-tail Z-test alpha=5% Z value of 1.96. But certainly one should discuss confidence intervals in the appropriate setting. Here is a great example of a paper that deploys an appropriate discussion of errors, confidence intervals, and chi-squared significance levels:

http://www.arxiv.org/abs/cond-mat/9910291

In particular I refer you to the discussion on page 3, where we discuss in alarming detail the validity of spin-wave theory.

Note, though, that it is unnecessary to mention in the plots in that paper that the error bars are standard errors. I respectfully disagree with Cumming and Finch; theirs is a distinctly minority view.

In circumstances where more than one kind of uncertainty is being reported, and where a calculation of such can be made, it is customary to address these uncertainties in the text, e.g. "the value of x is predicted to be 1.6932(43)(32) where the first uncertainty is systematic error and the second is the standard error of the mean of the observations." In the accompanying plot, I would then show SQRT(0.0043^2+0.0032^2) as the error bar. I would NOT convert the error to a 95% confidence interval and plot that.

We may be talking about different populations of scientists. Not to put on airs, but it is quite possible that psychologists really don't give a damn about statistics, while physicists do. And that may have something to do with the relative importance of accuracy in our specialties.

Regards
BBB

Umm...

Let's just say that mathematical precision is probably more important when you're trying to build a mathematical model than when you're simply trying to describe a phenomenon. Since physicists and psychologists both do both things, there are times that physicists are more concerned with "accuracy," and times that psychologists are.

Dear Dave,

I can let it go at that. This has been an enjoyable exchange -- thanks.

BBB

Ha! I win! Both at reading captions AND knowing what the hell I'm talking about!

All bragging aside, you're wrong on the first one. The answer is yes.

By Jongpil Yun (not verified) on 07 Apr 2007 #permalink