Technorati Tags: scale, computation, information
Since people know I work for Google, I get lots of mail from folks with odd questions, or with complaints about some Google policy, or questions about the way that Google does some particular thing. Obviously, I can't answer questions about Google. And even if I could, I wouldn't. This isn't a Google blog; this is my blog, which I write as a hobby in my free time.
 But there are intersections between my work life and my hobby. And one of the ideas that underlies many of the questions that I receive, and which also
hits on my work, and my hobby. And that's the idea of scale. Scale is computer-science talk for how things change as they get bigger. In particular, I'm talking about the scale of information; the amount of information that we use on a daily basis has increased dramatically, and the amount of dramatic, fundamental change that has resulted is both amazing, and amazingly unnoticed by most people.
What's scale?
Take a bunch of information, and think about how you would analyze it. Suppose you have a list of, say, music tracks. On my computer, I've got 4,426 tracks of music. Now, suppose I wanted to go through those tracks, and summarize what's there. If all you want to look at is simple information - the name of the track, the artist, the length, the genre, etc., it's trivial. You could do it in seconds on any modern computer. Heck, you could do it in seconds on any modern cellphone.
 Now, suppose that instead of 4,000 tracks, it were 4 million. That's still
not a big deal. I can still do all sorts of interesting analysis of 4 million
records in minutes. With a modern computer, it's no big deal.
But what if I wanted to do something more interesting, like look at the files and try to compute the tempo? Now instead of dealing with four million sets of strings, I'd need to actually look at the full content of the songs, to compute the tempo. OnIn my mumic collection, the average track seems to take up a couple of megabytes; let's say it's just one megabyte per song. To analyze a library of 4,000 songs at one megabyte each is doable on a modern PC, but it's going to take a while. To do it to a library of 4 million tracks is simply not feasible. I can buy a disk for my computer large enough to store that many tracks, but there's no way that I'd ever be able to run an interesting sonic analysis and finish it.
 Services with music databases like LastFM, Pandora, and iTunes maintain
information on tens of millions of tracks, and many of us play with those
every day without ever thinking about what's going on. LastFM and Pandora
analyze millions of tracks of music, so that when you tell it what you like, it
can suggest similar things based on the musical structure. Ten years ago, having a computer do that would have been impossible. Today it's routine.
 That's scale - the way that things change as you increase the size of the problem.
Things get bigger and bigger, and eventually the size of the problem crosses a line so that the increase in size fundamentally changes the problem.
 Again, let me try to put the idea of quantities of data into some kind of terms
that intuitively give you an idea. I bought my first computer in 1982. It was a
Franklin Ace 1000 - and Apple II clone. It had a floppy disk drive, which stored
140,000 bytes on a 5 1/4 inch floppy disk. My next computer was a big step up - an
Amiga, which stored around 800,000 bytes on a 3 1/2 inche disk drive. Today, my
cellphone has a removable flash drive which can store 8,000,000,000 bytes, on
a little tiny card about 1/4 inch on a side, and about the thickness of a greeting
card. If I were to convert the storage on my cellphone to floppy disks on my old Apple
clone, it would require about 57,000 disks. That would be a stack of floppy disks
almost 500 feet high. 
 We're routinely carrying around quantities of data that we simply don't have
the ability to intuitively understand.  A gigabyte is a billion letters. What does that mean?  We just don't think about it. The closest we come is
to break it down some other way; instead of calling it a billion characters, we'll call it a hundred songs. But the availability of that quantity of data has dramatically changed the way we work, and the things we can do. I used to
have a little briefcase full of cassettes for my car stereo. I carried this stupid thing around, with 100 little plastic cassettes in it, just to have a decent selection of music to listen to. Now I've got 887 tracks on my Android, and I complain about how I don't have enough space for a decent selection! That's scale - the
amount of data I can afford to carry with me has changed dramatically, and as a result, my expectations about what's normal have changed.
 The effects of scale are very counter-intuitive. By making the problem bigger,
in some cases, you make it easier. There are things that you can do
easily with data at massive scale that you couldn't do at all with
small quantities of data; and there are things that you can easily do with small amounts of data that become impossible when you scale upwards.
 We'll start with the upside. Scale means that you can do some amazing things
that used to be impossible. For example, scale means that you can record the entire genomes of large numbers of organisms. Last week, my fellow ScienceBlogger Tara Smith at Aetiology wrote an article about a recent Ebola outbreak; doctors studying the outbreak wanted to see what strain of Ebola they were dealing with, so they sequenced it, and compared it to the existing library of Ebola genomes, and identified it as a new strain, but related to one of the known strains. Scientists now now routinely sequence the genes new infectious diseases, and compare them with a range of known variants. In fact, they do more than that; to quote one website
discussing this:
"Due to the explosive growth of the field of Genomics, several
databases containing fully sequenced genomes of many viruses have been created and
placed online for easy access. By taking the newly sequenced Ebola genome data,
scientists were able to compare its composition to that of other viruses. When the
translated genome of the Ebola virus and its respective subtypes was compared to that
of other viruses, something of particular interest stood out. There was an almost
identical match between the immunosuppressive motifs of the oncogenic retroviruses,
murine leukemia virus (MuLV) and feline leukemia virus (FeLV) and a region of the
Ebola genome."
To translate that into simpler terms: they didn't just compare the genome of the new Ebola variant to other Ebola viruses; that wouldn't be particularly difficult, since the Ebola genome is only about 19K of data. But they compared it to the sequenced genomes of every previously sequenced virus, and found nearly identical sequences in Ebola and feline leukemia virus!
 That's the good side of scale. Things that we wouldn't have imagined just a few years ago are not just possible, but they're downright routine - because carrying
around and manipulating massive quantities of data is now routine. Scientists now casually work with massive quantities of information, and by doing that, they're finding completely new ways of doing science. Ten years ago, if they could have gotten state of the art equipment, the scientists looking at the Ebola genome could probably have sequenced it. But they would never have dreamt of comparing it to feline leukemia virus. But with the scale at which we can now work, it's harder to exclude some probably irrelevant virus than it is to go ahead and compare it, just on the incredible off-chance that there might be something there. 
 In some ways, it's almost like something out of Douglas Adams' "Hitchhiker's Guide
to the Galaxy" made real: we can now generate answers to questions that we didn't even
ask, and by there are answers laying in the data we have available for questions that
we don't even know. By searching for apparently meaningful patterns in large
quantities of data, we can find answers before we even know what the questions are.
(Of course, just like in HHGtG, we still need to figure out what the question is.)
Scale has dramatically changed science: we can routinely look for things and discover
correlations that we don't expect, which opens up whole new questions that we never
thought of before.
 That's the happy side of scale: the ability to mine huge quantities of data means
that we can look at things and discover things in a whole new way. But there's also
a downside: there are things that would be  easy for small-scale data that becomes unmanageable on a large scale.
 In computer science, we've traditionally drawn a line between things that are really computable, and things that while theoretically computable, simply take too long on any real computer to be doable. We've drawn that line in terms of something called algorithmic complexity. The line has traditionally been that anything whose
complexity can be bounded by some polynomial computed from the size of its input is computable, while anything whose complexity is bounded by some exponential in the size of its input is not. But scale changes that. When you're working with large
scale data, polynomial complexity is not really doable. Even if you've got something
which grows very slowly, like sorting, becomes close to impossible at scale.
 A great example of how even people who deal with massive quantities of data
haven't really grasped what a deep fundamental change we're seeing in the scale of
things comes from the database world. A while ago, a database magazine published an
article critiquing Google's MapReduce for not being sufficiently
relational-database-like. I wrote 
href="http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php">a
post responding to it, explaining why that's wrong. I still get mail about that
old post from database folks who think that the stuff that's typically done by
MapReduce should be done by relational databases.
 Databases were designed to deal with what were (at the time) considered massive
quantities of data. And as database people keep reminding me, modern databases support clustering - meaning using multiple computers in parallel to make things faster - just like Google, so that they're more scalable than I give them credit for.
 BUt there's massive, and then there's massive. Databases are great
at handling large quantities of data. But there are collections of data that are simply too large for current database technology. And much of the reason
for that isn't just because databases haven't been designed to work with network-scale datasets, but because many of the fundamental database algorithms and representations simply don't work on the massive scale.
 It's easiest to explain this through an example. This past weekend, 
href="http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html">a team
at Google (a team which I have no connection to; I know no one involved in
the task) announced the results of trying to sort one petabyte of data,
stored in 10 trillion 100-byte records. Doing this required 4,000
computers and 48,000 hard drives working together for just over six hours.
 When database people talk about clustering, they're usually talking about up to a
dozen or so machines. (I just did a quick search of IBM's DB information on the web,
and couldn't find any reference to using more than 12 CPUs.)
 There's a big difference between dealing with 12 CPUs and gigabytes of data, and
4,000 CPUs and a petabyte of data stored on 48,000 disk drives! Modern large scale
systems are increasingly more like the petabyte scale of large map-reduce than the
gigabyte scale of databases. According to the post about the petabyte sort results,
Google routinely processes 20 petabytes of data every single day. I'm sure
that other large internet companies are seeing similar quantities of data. Both Yahoo and Microsoft are certainly maintaining search indices that have to be in that
rough size range; ISPs need to log traffic information for millions of users;
cellphone companies must be recieving billions of data packets per hour providing them
with location information about cellphones; and so on. And it's not just
internet companies: maintaining genomic data on thousands of human beings (which is
something medical researchers want to do); or maintaining infection disease tracking
information for large populations; or even managing auctions and bids on eBay all involve storing 
 That's scale, and that's downright scary. Just sorting one
20th of the data processed by Google every day takes a team of brilliant engineers
to figure out how to do it, and the result requires tens of thousands of computers and hundreds of thousands of disks!
 Imagine how many computers it must take to encrypt all of that information! Companies like Google, Yahoo, Amazon, and Ebay are recieving millions
to billions of requests per day, and they're not storing the information about those requests in text files that any bozo can read. Just think about what it means to
encrypt 20 petabytes of data per day.
 There's another important downside to scale. When we look at large quantities of information, what we're really doing is searching for patterns. And being the kind of creatures that we are, and given the nature of the laws of probability, we are going to find patterns. Distinguishing between a real legitimate
pattern, and something random that just happens to look like a pattern can be somewhere between difficult and impossible. Using things like Bayesian methods
to screen out the false positives can help, but scale means that scientists need to learn new methods - both the new ways of doing things that they couldn't do before, and the new ways of recognizing when they've screwed up.
 There's the nature of scale. Tasks that were once simple have become hard
or even impossible, because they can't be done at scale. Tasks that were once
impossible have become easy because scale makes them possible. Scale
changes everything.
 
Thank you for this very interesting glimpse into the world of big data. The structure of modern commercial database systems reflects a time when memory, disk space and CPU cycles were all scarce resources. I have often wondered how they might evolve in today's environment - would statistical methods and tools replace boolean ones? Uncle Joe's quote about quantity having a quality all its own applies here.
"Feline leukemia!"
"We're going to get lynched, aren't we?"
Very nice post, there's some very cool examples of scaling in there.
It might also interest people learning about scale to check out Clay Shirky's, Here Comes Every Body, which covers a lot of the social implications of scaling in Internet technology and "social software", which also demonstrate huge differences in simply what is possible, and what is no longer not possible. "More is different"
-- Lorenz
Neither LastFM nor Pandora actually examine the contents of songs. LastFM uses listening histories (collected via the sites flash player and plugins to various players) and Pandora uses attributes determined by trained humans ("the music genome project").
MusicBrainz does analyze the content of songs to determine if they're the same. The analysis has changed, but the idea is to compute an identifier (vaguely similar to an MD5 hash) that doesn't change simply because one file is a 64Kb/s Mp3 and the other is 128Kb/s Ogg. The analysis is performed by an individuals computers, not the service.
My litmus test for people who claim unlimited scalability is: does that apply when you have a Shannon?
As I defined in "Quintillabit: Parameters Of A Hyperlarge Database", Proceedings, Sixth International Conference on Very Large Data Bases, October 1-3, 1980, Montreal, Quebec, Canada. IEEE-CS, 1980, IEEE Catalog Number 80CH1534-7C, Library of Congress Number 79-93333 :
1 Shannon = 1 mole of bits = 6.02 x 10^23 bits.
Huh? 12 CPU for regular DB? Four years ago when I last was involved a medium large TPM system with ERP used 128 CPU's and half a terabyte of memory in two boxes. Thousands of really concurrent updates. Oracle DB. This was not huge. You are way underestimating the scale of modern relational databases.
On the other hand to try to shoehorn in map reduce as some kind of relational operation is silly.
Great blog! I'm a biologist myself (or at least in training), but I have quite a few computer scientist friends and hence hear quite a bit about dilemmas like these, albeit not phrased quite so well.
You mention the new ebola variant that was discovered, and how it was BLASTed to discover that it shares some sequences with FLV. That pleased me...I think that biologists, particularly in my branch of biology (systems biology) could benefit hugely from the lessons learned by computer scientists in regards to data handling.
We are in desperate need of database specialists to build us new BLAST algorithms. Genomic DNA is a weird thing; it's information content is stored on multiple levels (for instance, while the rote genetic code obviously encrypts data, data is also encrypted in the number of times a sequence repeats, or the density of certain sequences in a given area), which the BLAST algorithms are comparatively poor at searching for. And understandably so, their job is to just look for similar sequences--but what if we want to search for, say, two sequences that are within some differential tolerance? Or, within redundancy for amino acid translation? (DNA is quite redundant, any given sequence will have a number of equivalent sequences that yield the same result...at least sort of...). However, i'm thinking that these tasks are just complex enough that, while only a little different from searching for sequence homology, prove enormously more processor-intensive when you're trying to compare a terabyte or so of sequence data.
I translated your article here:
http://habrahabr.ru/blogs/i_am_clever/45472/
Hope you don't mind :)
Excellent article and well done, but one thing I might point out is that modern relational databases routinely and easily handle databases of several terabytes, not gigabytes, without problem. And that of course is within a single database, many companies approach large datasets by logically breaking them into multiple databases, when that is both possible and appropriate.
cellphone companies must be recieving billions of data packets per hour providing them with location information about cellphones
Makes one wonder what the NSA's warrantless wiretapping program could actually succeed in turning up (particularly with a dearth of Arabic-speakers available to translate), assuming foes with the minor smarts (1) not to call suspect locations, and (2) to code verbal communications so as not to say anything overtly interesting.
I'll skip the obvious fact that most bookstore managers don't put much thought into their layout. There was a period from about 1940 to about 1975 or so when most publishers and bookstores thought fantasy written for adults wouldn't sell. So a great deal of fantasy was published in the SF category. This period came to an end with the success of publishers such as Del Rey and Bantam, who proved you could sell fantasy to adults without pretending it was science fiction. (Also, note that for about 4 decades the largest American professional organization of writers in either genre has been the SFWA - Science Fiction and Fantasy Writers of America.) Once publishers and bookstores had been convinced fantasy could be sold to adults, they needed a place to put it - and why not stick it with the SF, where they had been putting it for years (albeit under a different name)? At that time, during the late 1970s and 1980s, SF and fantasy shared many of the same writers and publishers.
Fantasy is where anything goes, and SF is where almost anything goes, provided appropriate technbabble is inserted. Totally different, forsooth. By any reasonable estimation, SF is a subset of fantasy.
As to writers that do both, In addition to the many named above, I recall Ray Bradbury, Harry Turtledove, Jack L. Chalker, Marion Zimmer Bradley, Roger Zelazny, Piers Anthony, David Drake (but his fantasy was beyond awful), Janet Morris, Leigh Brackett, Lin Carter, Edgar Rice Burroughs, Julian May, Andre Norton, and Orson Scott Card (yeah I hate his politics too). On top of that there are a number primarily SF writers who only wrote a handful of unambiguous fantasy short stories - such as Isaac Asimov, Larry Niven, Harry Harrison, and David Brin.
40% of Americans are high school dropouts. An uncomfortable fraction more are functionlly illiterate at graduation. Head Start is punctiliously documented 40 years and some $100 billion of futility. How shall more than half the population relate to a society functioning wholly beyond their comprehension? They have the same vote you do at the polls - and there are more of them.
Having started out on a 100 KHz computer with 25,000 bytes of main memory (IBM 7070), I now find that all my instincts are wrong. I just can't handle the idea of sorting a petabyte with 4,000 multiple-gigahertz processors (please tell me it was actually 4,096; I can still do powers of two). How can I design a new program, when my brain is telling me to worry about saving a hundred instruction executions here and a thousand bytes of storage there?
Fortunately, I'm retired, and won't be inflicting the world with a slightly advanced version of two-digit year numbers to save those extra bits of storage. I'm afraid that the world of computing will have to become a spectator sport for me, much as physics and astronomy did years ago. Ah, well...
From an earlier post on my blog, a related quote: "In 2002, I had DSL coverage at 768Kbps and my hard drive at home was 20GB. In 2007, I have 1.5Mbps, and my hard drive is 250GB. My data rate has doubled in five years while my data storage has gone up 12.5 times. Already I rely heavily on sneakernet and various data storage media to move bits around. In another five years, we'll be relying more on bike messengers and FedEx to move our data around *physically* overnight, because the staggering amounts of data people acquire simply won't be able to fit through an electronic transfer."
I won't even talk about acquiring my first computer that actually had a hard drive!
Google, MSN, and Yahoo have something many companies don't have - bandwidth capacity on their internal network. Analyzing petabytes of data isn't just a node-capacity problem, it's a network throughput problem. Getting 400 TB of data from geographical location A to geographical location B is not a trivial problem.
Scaling problems get bigger as systems get more complex because of bottlenecks. The bigger the system, the more bottlenecks you have to deal with, and the wider base of knowledge is required to know where the problem *actually is*.
Now I admit that I have never worked with petabytes of data. However, in my experience, the false patterns can actually be very reliably filtered out by 2 stage process (that people often forego in favor or fancier/complex algorithms), even though the fact that they have to look at all the data, lends it to this simple method.
Perform a study on a statistically significant sample, than perform the study on the entire set of you data. If the pattern is true, your algorithm's confidence (pvalue or whatever metric you are using) must become stronger. If you got the power to run through petabytes, you shouldn't warry about running a pre-pass on ~10% of it (or more as the case may be). It won't remove all the false patterns, but it tends to reduce the number enough that even a human can sift through the rest.
Again, this only works if you are going to look though the entire set and don't just have access a small sample. (which is the case in the examples Mark provided)
Bob Munck at 13: How do you think they managed to fit all that stuff on just 4000 processors? Saving one clock cycle on an inner loop when you have to do that loop 460 trillion (1E13 * lg(1E13)) times shaves off 30 processor hours of work on 4GHz processors. If you can do that, then your skills are still useful.
Though I'd avoid the 2-digit years nowadays. :)
You used a < instead of a <, didn't you?
Hope you don't mind if I talk for a minute about what I thought this post was going to be about. I've been thinking that failure to understand scale is a fundamental problem for intelligent design theorists. They keep talking about how gosh darn complicated everything is, how the flagellum is so damn small. It's only small compared to us.
And as long as I have someone's attention: am I right to think that "statistically impossible" is basically nonsense? That probability can approach zero but impossibility is zero and is a fundamentally different situation?
Hope I'm not talking gibberish from your perspective. Go ahead, ask me something about criminal law.
Fun post, Mark. "Next-generation" DNA sequencing is so fast/powerful, that we literally don't know what to do with all the information.
Two anecdotes: First, a seminar speaker mentioned that it's currently easiest to store the sequencing information *as DNA* - i.e., their computers are not fast/big enough to store all the sequencing data, so it's easier to re-sequence than it is to store all the previous runs.
Second, an interviewer offered me a project in which >10 million tissue-specific, (developmental) stage-specific, SAGE tag reads (which correspond mostly to mRNA) need to be examined. The goal: to find questions, exactly as Mark wrote. I chose another lab :)
Re #4:
In fact, Last.fm has been analyzing songs (producing format-insensitive digests, like musicbrainz stores) as they're played for some time now, though AFAIK they still aren't using that data for anything. I believe the eventual purpose is to rectify inaccurate metadata, but I haven't heard anything from the developers for a while.
#18: I've a similar argument made by Dawkins, and others. The sheer scale of some systems, be it in time, space, or frequency, makes our intuitions about probablity almost entirely worthless.
Thus, what may be supremely unlike to happen in our lifetime may be almost certain to happen in the life of the Earth, and what may be practically impossible to witness on Earth may happen every second somewhere in the Universe.
Similarly, several molecules coming together to form the first replicator may be laughably unlikely, but how many quadrillions of molecules were attempting just that?
Their rhetoric is useless*, because the intuition we use to understand it is flawed.
Useless to establish reasonable doubt, if not useless to sucker the ignorant.
Intersting post! Have you thought about using manifold theory to identify your data subsets?
http://en.wikipedia.org/wiki/Manifold
A quick example from space-craft/missle; the guidance of the missile is on one manifold I,and the control around the cg is on another manifold II.
For all practical design purposes, the two manifolds can be treated as independent systems.
The support vector concept is an old way to identify these hidden manifolds.
In my previously mentioned 1979 conference paper "Quintillabit: Parameters of a Hyperlarge Database" which someone has kindly scanned in as a PDF easily found, from the IEEE proceedings of 1980. My quantitative predictions of 1999-2000 A.D. (made, as I say, 20 years earlier) include:
* 6 x 10^9 people
* 1 x 10^9 of them using a global data network
* a megabit for a buck at home
* 10^18 bits of personal data in the network
* personal long-distance "wrist radio" for 3 cents/minute
* music library for 30 cents/hour
* Computer/Communications industry at 12% of Gross World Priduct
I was also wrong:
* CB radio by satellite (Daddy, what's CB?)
* video conferencing for $100/hour (cheaper, more widespread, lower res)
and vague:
* computer links and information service at 10 cents to $10/hour
* terabit chip (well, a bunch of chips parallel in one little box)
Anyone else have a comment on that ancient Moore's Law set of predictions?
Or how things scale up in our future, as we approach a Shannon = 6.02 x 10^23 bits?
@23 (Jonathan): 10^18 bits of personal data on the network? I am considered an information hog by my friends, and have about 10^13 bits across all my devices and storage units. I'm a bit skeptical of 10^5 times that much data about me floating around the 'net. I could believe an equivalent amount, or even 10^2 times the amount, but I think that we're off of an exabit of information by about three orders of magnitude.
Excellent post. You make difficult subjects easier to grasp. It is a real talent. I program in an IBM db2 environment with a database with several billion rows. We partition the database which runs on a mainframe workhorse. I am interested in knowing what sort of patterns Google looks for in its petabytes of data. It is probably related to your advertising business. Why spend on that resources otherwise. Like what makes an advertisement effective, or how to reach buyers most efficiently.
tigerhawkvok: Of course, you may be right. "[I] have about 10^13 bits across all my devices and storage units...."
Not just the data that you have on your personal devices, because that's the tip of the iceberg.
Yes, and how much do all the stores where you've ever bought anything by credit card have on you? And in the databases of what government agencies? Including all your phone conversations in which keyword searches were auotmatically done, and surveillance videos stored on local memories or encrypted to servers so that web-based spot-checking can be done by off-site people?
Why, for instance, does WalMart have many tens of Petabytes of commercial data, and has tested years ago the ability to auto-generate customized snailmails to customers saying: "Last Holiday Shopping Season you bought this DVD xxxx, and this game disk yyy, and this pair of pants zzzz, plus this smoked sausage basket wwww. Based on your preferences, and those of people with your style of good taste, we recommend these gift purchases, ttt, uuu, vvvv. Call this 800 number, and well have those ready to pick up at the Walmart closest to your home address aaaa, or work address bbbbb."
In Sweden, there are not allowed to be any secret adatbases on personal information, in corporations or government agencies. In the USA, there is most resoundingly no such law.
Those folks with discount cards from supermarkets -- how long do you think it will be before your medical insurance is discontinued, because the analysis of how many fatty snacks you purchase?
Great post. Scale does make a difference. I remember all sorts of fretting about database management for an airline keeping track of all its flights and passengers, but then I did a few quick estimates. Even 15 years ago everything fit into memory with room to spare.
Problems often change nature as our computer resources resize. I remember when the Navier Stokes equation was driving supercomputing in some quarters and there were surprisingly accurate predictions of when it would be practical to do a half a wing, a complete aircraft, a propeller, and then a helicopter.