Head down to Box Office Mojo and pull up the list of the top grossing films of the year thus far. Seven of the top ten have a dollar gross beginning with the number 1. Okay, that's not too weird. Big films tend to pull down somewhere between $100-200 million, while only the real monsters have high grosses. So what if we look at the inflation-adjusted all-time list, which is less likely to be fixed by the coincidental size of the film-going public and ticket prices? Again, seven of the 10 have grosses beginning with 1.

Well, maybe movies are just weird. What about cities? In the US, five of the top ten cities have a population figure which begins with a 1.

Maybe cities are just weird too. How about election results? If you rank the states of the 2008 US presidential election by Obama's vote total, zero of the top ten have Obama vote totals beginning with 1 - but then again, all the rest of the top 20 did.

Why this preponderance of numbers that happen to start with 1? Is it just an artifact of the data sets I've picked, or something more interesting. Try a thought experiment:

Pick a number, say, one million. Write it out in decimal notation and it reads 1,000,000. Its first digit is the number 1. If you increase or decrease 1,000,000 by ten percent, you get 1,100,000 or 900,000, which start with 1 and 9 respectively. If you increase or decrease 1,000,000 by twenty percent, you get 1,200,000 or 800,000, which start with 1 and 8 respectively. If you increase or decrease 1,000,000 by thirty percent, you get 1,300,000 or 700,000, which start with 1 and 7 respectively.

Continue this exercise and basically the pattern continues. Essentially the million numbers following 1,000,000 start with 1, but the million below 1,000,000 can start with just about anything, including 1.

Obviously had you started with (say) 3,000,000 the effect would be much less pronounced, but it would still be there. It's possible to rigorously analyze this sort of thing, and the result is Benford's Law, which gives the probability distribution for the first digits of random numbers:

Plotting this distribution gives:

From Benford's law, you'd expect around 30% of leading digits to be the number 1. Not every set of randomly chosen integers satisfies the conditions required to Benford's Law and its odd preponderance of 1s, but lots of them do. In the financial industry, the law has even been used to search for fraud. Humans are generally terrible at making up random numbers that act anything like actual random numbers, and as a result the figures they make up when cooking the books don't tend to satisfy laws like Benford's.

Unless you're angling to hang out with Bernie Madoff in Club Fed, you should probably use your math knowledge for good rather than evil. But if you're gonna cook your books, your recipe should probably include about 30% 1s as leading digits...

- Log in to post comments

Benford's law does not work for binary; it gives '1' as the leading digit with frequency 1. The leading digit of zero is '0'.

It works just dandy for binary!

log2(1 + 1/d) gives 1 for a number starting with 1, just as you'd expect.

If one wanted to apply Benford's Law-type reasoning in a binary context, one would presumably use a generalization to leading prefixes of length > 1.

cf. a report from February, 2010, that "The number 4 occurs less frequently than chance would dictate in the tenths of a cent digit for quarterly earnings." Of course, this may be statistically significant for a data set consisting of all companies, but not for any individual company, at which point it's hard to crack down on this sort of thing.