Last week NIST anounced the winner of its Cryptographic Hash Function Competition. After five years of review and many rounds of discussion and elimination, the winner is a hash function called Keccak, and its developers deserve many congratulations. It's a shame hash functions aren't better known in the general public, because not only are they a vital part of keeping data safe online, they're one of the most interesting bits of applied math. Better still, their basic concept is not complicated at all.
Hash functions are so cool, in fact, that I want to spend several posts this week discussing them. Today we'll just talk about what they are, and why they might be useful.
A hash function is just a mathematical function like any other. You put in a number, and the function spits out another number. In high school you dealt with functions like $latex f(x) = x^2$. Put in 2, it spits out 4. Put in 3, it spits out 9. A hash function is less predictable. It's designed to be less predictable. In fact, a working definition of "hash function" is a function whose output is a number with no obvious relation to the input.
Computers internally represent text as numbers, so the first letter of my name - "M" - could be represented by its numeric ASCII code - 77 - and fed into a function. Here we won't worry about the details of the text-to-number conversion. We'll just say that it's easy for a computer do so we'll just write the input of the function as text (It could also be an image, movie, or any other kind of digital data.). One hash function is called CRC-32, and if we denote it c(x) some examples of its output are:
c("Matt Springer") = 2690866847
c("ScienceBlogs") = 385760650
c("Science Blogs") = 531647807
Even very similar inputs can result in very different outputs, as expected. It's also important to notice that CRC-32 is a 32 bit hash, which just means every output is a number between 0 and 2^32 (which is 4,294,967,296). Of course there's more than four billion possible pieces of data in the world, and therefore some identical inputs which will correspond to the same outputs. While two random inputs will only have a 1 in 2^32 chance of producing identical hashes, the birthday paradox means that if you have a bunch of random inputs, odds are you only need on average 2^(32/2) = 65,536 inputs before you end up with a duplicate hash. That's not so many. There are even single words in the dictionary that have the same CRC-32 hash. Both "plumless" and "buckeroo" hash to 1306201125, for instance.
If you really really need to avoid duplicates, there no alternative but to use a hash function with a longer output. 128, 256, and 512 bits are common sizes. With a 256 bit hash, there's 2^256 possible outputs (more than 10^77) and odds are you'd need about 2^(256/2) random messages (about 10^38) before you end up with a duplicate hash. Thee are astronomical numbers, and unless something is mathematically wrong with the particular hash function in question we will never live to see see two different inputs produce the same 256-bit hash. One 256 bit hash is called SHA-512, and my name hashed using that function is
SHA-256("Matt Springer") = 20005913487026535327234686971684230532576326699423435238922925367385304409606
Which is a gigantic number, and vanishingly unlikely to recur by chance for a different input.
Ok, so a hash function takes some data as its input and spits out a seemingly-random number. If I download the text of Moby Dick from Project Gutenberg, I can compute the CRC-32 hash of the entire book. It turns out to be 1206970038. If I change one single random letter of the text, the hash is different (It became 3207150610, for the random single-letter change I made). This suggests that hashes are useful in error detection. If you download some huge file for which message integrity is extremely important, one way to be sure you got the file without any errors is for the file provider to compute and post the hash along with the download link. When you're finished downloading, you compute the hash of your copy and check to make sure it's the same as the one posted by the file provider.
You can do quite a bit more than just error-detection with hashes. One example is password verification. If I run a bank and you want to do online banking, it might be safer for me as the bank not to keep your password on file. That way it can't be stolen. But then how can I verify you have the right password when you type it in? Simple. When you set up your password, I compute its hash and keep that on file. Every subsequent time you long it, I take the password you give me, hash it, and check to see if it matches the original hash. That way I can check to see if you know your password without having to keep a permanent record of your password on my server.
You can also use hashes to assign ID numbers. You can use them to generate pseudo-random numbers. You can use them to efficiently address data in a computer memory system. There are many other applications. Most importantly, you can use them in cryptography. But cryptography means you might have intelligent, well-equipped adversaries trying to break your codes. For these applications, which will be tomorrow's subject, we can't just say "well, the output of our hash looks pretty random" and call it a day. Bad hash functions in cryptographic applications can compromise everything from e-commerce to national security, and this is why NIST has spent so much time and effort testing and publicly reviewing potential candidates for the next generation of cryptographic hash functions.
If you want to play around with some hashes yourself, there's many online calculators such as this one that you can use. Give it a try!
[Note: hash outputs are almost always actually written in hexadecimal notation. The CRC-32 hash of my name would usually be written as a0635e9f, which is both more compact than the decimal form and easier for a computer to work with. I've also picked CRC-32 as my example because it's common and easy to work with. It's really an error-correction code rather than strictly a hash, but for non-cryptographic purposes it doesn't really matter.]