Benford's Law of Amazon Rankings

Late last year, Matthew Beckler was nice enough to make a sales rank tracker for How to Teach Physics to Your Dog. Changes in the Amazon page format made it stop working a while ago, though, and now Amazon reports roughly equivalent data via its AuthorCentral feature, with the added bonus of BookScan sales figures. So I've got a new source for my book sales related cat-vacuuming.

Still, there's this great big data file sitting there with thousands of hourly sales rank numbers, and I thought to myself "I ought to be able to do something else amusing with this..." And then Corky at the Virtuosi did a post about Benford's Law, and I said "Ah-ha!"

Benford's Law, if you're not familiar with it, says that in a large assortment of numbers generated by some process, you expect the first non-zero digits of all the numbers to be distributed in a logarithmic fashion. About 30% of the first digits should be "1," and only about 4% of the first digits should be "9." This goes against the naive expectation that the numbers ought to be evenly distributed, and is actually used by "forensic accountants" to catch people who are cooking their books-- someone who is making up numbers to fill a phony set of books is fairly likely to pick numbers that don't follow a Benford's Law distribution.

So, I've got 6,818 hourly values of the Amazon sales rank for my book, spanning almost three orders of magnitude. How do those digits match up with Benford's Law? Well:

i-c92e76f3144932da122da397066ff84e-benford_rank.png

That's... pretty good, really. The blue diamonds are the actual frequency of the digit, the red squares are the prediction of Benford's Law. There's a slight shortage of 1's and a surplus of 5's and 6's, but all the actual frequencies are within about 5% of the expected values. The most basic assumption about the statistics of this sort of data set would lead you to expect an uncertainty of about 1% (that is, 1 over the square root of 6818), but that's pretty crude.

What does this tell us? Not a whole lot, really. if Amazon is somehow fudging their sales rank data (which I have no reason to suspect them of doing), they're clever enough not to get caught by this really crude analysis of one book's figures.

Making this graph has, however, given me a way to put off some tedious and annoying work for another hour or so, so let's hear it for Benford's Law!

More like this

Nice illustration of the law!

Did you know that the eponymous Frank Benford after whom the law was named was a research physicist at GE who lived just down the road from Union College (on Rugby Road)?

Benford was inspired to investigate the law empirically after noticing patterns in the dirtiness of the logarithm pages. (This was back in the days when scientists spent a good chunk of time looking up the logs of their data in order to speed up their calculations.) If he (or the prior discoverer of the law, Newcomb) had had access to graphing calculators, who knows if or when anyone would have noticed.

http://www.dspguide.com/ch34/1.htm

Is there any similar law for second and subsequent digits? I can't see why there would be, but then I'm no mathematician, and I would have guessed that the first digits would be random.

By Equisetum (not verified) on 01 Jan 2011 #permalink

A very interesting post, and it got me thinking...

I have a script for my own book, and a similarly sized dataset, so I decided to run my own numbers. I get a result very similar to your own (1=24%, 2=15%...9=5%), but it's not entirely obvious why. I don't mean that it's not obvious why Benford's law has the form that it does. What I mean is that it's not clear that sales ranks should obey it. The distribution of ranks is most definitely not scale invariant, for example. A certain rate of sales corresponds (more or less) to a certain sale, and at high sales rates, the variance will be relatively small.

And clearly this distribution can't be a general law for a fixed population. Suppose there were exactly a million books. At any given instant, exactly 1/9th have each leading digit (Technically, "1" has 111,112, but that's just quibbling). From a frequentist perspective, Benford's law simply can't drop out.

Of course, since the number of books isn't 1 million, and isn't fixed, this isn't a perfect line of reasoning, but I must confess, I'm still puzzled why amazon ranks should follow Benford's law.

I have a script for my own book, and a similarly sized dataset, so I decided to run my own numbers. I get a result very similar to your own (1=24%, 2=15%...9=5%), but it's not entirely obvious why. I don't mean that it's not obvious why Benford's law has the form that it does. What I mean is that it's not clear that sales ranks should obey it. The distribution of ranks is most definitely not scale invariant, for example. A certain rate of sales corresponds (more or less) to a certain sale, and at high sales rates, the variance will be relatively small.

The addition of BookScan data to the stuff reported by Amazon will be an interesting check on the correspondence between Amazon sales rank and total sales. It'll be a while yet before I have enough data on that to say anything meaningful, though.

I agree that the full set of sales ranks can't possibly follow Benford's Law, since all the ranks need to be filled (in principle, anyway-- I'm not sure how they handle ties). Any one randomly-chosen book, though, will follow a trajectory through the ranks that is essentially random. That will presumably result in something close to a Benford's Law distribution for that one book's sales history.

At least, that's the justification I came up with when I started thinking about doing this. It may be that a more careful analysis would show a different distribution, and that the deficit of 1's that we see is a real effect, but figuring that out is beyond me. Especially with classes starting tomorrow.

To me it seems obvious that if you have a random sample set, starting at zero (OK that is a bit unlikely, but most sets probably do), the chance if the first significant digit being 1 increases as the first sig fig of the maximum size of the sample set drops toward 1. Once it reaches there, the chance diminishes again until you reach 9, and it starts again.

At any point, you will never have a situation that the chance of the first sig fig is less than 1/9, so that is the minimum. The maximum would be at, say 1 to 199, where the chances are (1/9 + 1)/2 (roughly). This is a range of 11% to 56%. If I simplistically assume an average of these, I get 33%, which is pretty close to what Benford says it is, at 30.1%.

I see comment (elsewhere) that state: âEveryone knows that our number system uses the digits 1 through 9 and that the odds of randomly obtaining any one of them as the first significant digit in a number is 1/9. â
And that appears immediately false to me.

Benford's law seemed obvious to me as a perfectly natural thing to occur after about 10 seconds thinking about it. Am I missing something?

@Jerome, thanks for convincing me I'm perhaps not crazy. I've been having trouble seeing why this isn't an obvious consequence of the fact that a "random" number occuring in the real world is drawn from a finite (i.e., less than so-and-so) set of numbers.