How Many Books Is That?: Modeling Amazon Sales Rank

A few months ago-- just before the paperback release of How to Teach Physics to Your Dog-- Amazon started providing not only their Sales Rank data, but also sales data from Nielsen BookScan. Of course, the BookScan data is very limited, giving you only four weeks, and the Sales Rank data, while available over the full published life of any given book, are presented as a graph only with no way to extract them as a data table. You'd have to be some sort of obsessive nerd to make a quantitative comparison between them.

So, anyway, here's the data I got for How to Teach Physics to Your Dog:


This is a graph of BookScan sales data for each week vs. the average Amazon Sales Rank for that week. Blue points are the hardcover for How to Teach Physics to Your Dog, brick-red the paperback. The one lonely point in the upper left is the sales figure for the first week of the hardcover sales, which a little bird provided for me at some point (I don't have BookScan access outside of the Amazon service).

This is a log-log plot, so those straight lines fit to the data are power laws. According to this, to estimate the number of copies the hardcover has sold based on its Sales Rank, you would use the formula:

N(sold) = 35900000R-1.34

Where R is the Sales Rank. This is obviously a little dodgy, given that there's only that one little point floating way up on the left, and to do a better job, I'd need to have lots more points with high ranks and large numbers of copies sold. If you want to go out and buy several hundred paperbacks of How to Teach Physics to Your Dog from Amazon over the next couple of weeks, I'd be thrilled to have the data...

So, anyway, what good is this? Well, I know exactly how many paperbacks have been sold, thanks to Amazon's BookScan numbers, but I don't have the numbers for the hardcover. I do, however, have reams of data from the nifty Sales Rank tracker Matthew Beckler wrote for me (which stopped working a while ago, but has been supplanted by the Amazon features, anyway.

So, using that handy formula from the graph above, and the several months of sales ranks from the tracker, what can I say about the number of books sold? Well, after laboriously converting the ranks to approximately the same format as the data from Amazon, I can plug in the average sales rank for the hardcover for every week it was out, and use that to estimate the number of copies sold. The resulting data look more or less like this:


This gives you a rough idea of the number of copies a moderately successful pop-science book sells in the US. What's the total sales figure? That's a little tough to say, because the fit to the first graph is still fairly sensitive to exactly what data are included--the two fit parameters wander around a little, and because it's a power law spanning a couple of orders of magnitude, small changes in those values can lead to substantial changes in the estimated total.

With that caveat, the estimated value looks to be between 4000 and 5000, probably toward the high end of that range. Now, comparing the fit function result with the data I have from Amazon suggests that this estimate is low by about 10% (this works for both the hardcover and the paperback numbers, and for several different subsets of those data). Using 4800 as the estimated number, that extra 10% gets you to 5280 books (one for every foot of a one-mile stretch...). BookScan estimates that it covers about 75% of all sales, so that would put the total at right around 7000.

How good an estimate is that? I can't really say, since I don't have BookScan numbers, and my royalty statements only give the number of copies shipped out to stores, not the copies actually sold to consumers. It's in the right ballpark, at least-- it's less than the total number printed and shipped, for example-- but beyond that I don't have any useful information.

This was an amusing way to spend a few hours crunching numbers, though.

(Interestingly, the UK edition has shipped more than twice the number of copies the US edition did, and has been on one British chain's bestseller list for a while now. Which just goes to show you the strong random element involved in the publishing business...)

The biggest weakness of this model is really the lack of data at high ranks and high sales-- there's just the one point fixing the top end of the power law, so adding new data points can swing the total value from the fit by a few hundred copies one way or the other. If I had more date in that region, the fit would probably be more stable, and the estimate better.

So, again, if anybody would like to buy several hundred copies of the paperback edition from Amazon over the next several weeks (you could hand them out on the subway...), to boost the sales ranking and get me some better data, feel free. Of course, to keep the BookScan ratio about right, you'll also want to buy several hundred from your local big-box chains... But it's for SCIENCE!, so it's all good...

More like this

"(Interestingly, the UK edition has shipped more than twice the number of copies the US edition did, and has been on one British chain's bestseller list for a while now. Which just goes to show you the strong random element involved in the publishing business...)"

I'll take issue with that. The Brits are very much more interested in physics than us Yanks. They have Newton and Hawking, and those few who know Rutherford and Dirac, and they're about 5 times more civilized than us Americans. The European mindset is that education is good in its own right, the American being it's what you have to do to get a diploma or a degree, dangit.

We all like dogs, but the science is the thing. Don't be surprised if your next book about relativity and Einstein does very well in Germany. Just a prognostication, we shall see.

Well, I bought your book the week it came out, from Amazon, and based on Woit's review, so I see my data point. ;-)

Hey there,

I just finished my qualifying exams this past week, so I suddenly have ample spare time on weekends to update sales-rank trackers for physics books involving nice dogs. I upgraded the page-scraping script to be python instead of bash, so I can use the nice lxml xml-parsing library. There isn't much new data yet, but I've restarted the cron task and it should be gathering data every hour, starting now. Enjoy!

Matthew Beckler