# Hash Week! (Part 1)

Last week NIST anounced the winner of its Cryptographic Hash Function Competition. After five years of review and many rounds of discussion and elimination, the winner is a hash function called Keccak, and its developers deserve many congratulations. It's a shame hash functions aren't better known in the general public, because not only are they a vital part of keeping data safe online, they're one of the most interesting bits of applied math. Better still, their basic concept is not complicated at all.

Hash functions are so cool, in fact, that I want to spend several posts this week discussing them. Today we'll just talk about what they are, and why they might be useful.

A hash function is just a mathematical function like any other. You put in a number, and the function spits out another number. In high school you dealt with functions like \$latex f(x) = x^2\$. Put in 2, it spits out 4. Put in 3, it spits out 9. A hash function is less predictable. It's designed to be less predictable. In fact, a working definition of "hash function" is a function whose output is a number with no obvious relation to the input.

Computers internally represent text as numbers, so the first letter of my name - "M" - could be represented by its numeric ASCII code - 77 - and fed into a function. Here we won't worry about the details of the text-to-number conversion. We'll just say that it's easy for a computer do so we'll just write the input of the function as text (It could also be an image, movie, or any other kind of digital data.). One hash function is called CRC-32, and if we denote it c(x) some examples of its output are:

c("Matt Springer") = 2690866847

c("ScienceBlogs") = 385760650

c("Science Blogs") = 531647807

Even very similar inputs can result in very different outputs, as expected. It's also important to notice that CRC-32 is a 32 bit hash, which just means every output is a number between 0 and 2^32 (which is 4,294,967,296). Of course there's more than four billion possible pieces of data in the world, and therefore some identical inputs which will correspond to the same outputs. While two random inputs will only have a 1 in 2^32 chance of producing identical hashes, the birthday paradox means that if you have a bunch of random inputs, odds are you only need on average 2^(32/2) = 65,536 inputs before you end up with a duplicate hash. That's not so many. There are even single words in the dictionary that have the same CRC-32 hash. Both "plumless" and "buckeroo" hash to 1306201125, for instance.

If you really really need to avoid duplicates, there no alternative but to use a hash function with a longer output. 128, 256, and 512 bits are common sizes. With a 256 bit hash, there's 2^256 possible outputs (more than 10^77) and odds are you'd need about 2^(256/2) random messages (about 10^38) before you end up with a duplicate hash. Thee are astronomical numbers, and unless something is mathematically wrong with the particular hash function in question we will never live to see see two different inputs produce the same 256-bit hash. One 256 bit hash is called SHA-512, and my name hashed using that function is

SHA-256("Matt Springer") = 20005913487026535327234686971684230532576326699423435238922925367385304409606

Which is a gigantic number, and vanishingly unlikely to recur by chance for a different input.

You can also use hashes to assign ID numbers. You can use them to generate pseudo-random numbers. You can use them to efficiently address data in a computer memory system. There are many other applications. Most importantly, you can use them in cryptography. But cryptography means you might have intelligent, well-equipped adversaries trying to break your codes. For these applications, which will be tomorrow's subject, we can't just say "well, the output of our hash looks pretty random" and call it a day. Bad hash functions in cryptographic applications can compromise everything from e-commerce to national security, and this is why NIST has spent so much time and effort testing and publicly reviewing potential candidates for the next generation of cryptographic hash functions.

If you want to play around with some hashes yourself, there's many online calculators such as this one that you can use. Give it a try!

[Note: hash outputs are almost always actually written in hexadecimal notation. The CRC-32 hash of my name would usually be written as a0635e9f, which is both more compact than the decimal form and easier for a computer to work with. I've also picked CRC-32 as my example because it's common and easy to work with. It's really an error-correction code rather than strictly a hash, but for non-cryptographic purposes it doesn't really matter.]

Tags

### More like this

##### Hash Week! (Part 2)
Yesterday we looked at hash functions. As you recall, they're functions which take an input and generate a random-seeming output. As a quick example, here's the output of the SHA-256 hash function for the name of the Scottish physicist James Maxwell and a misspelling thereof: SHA256("James Clerk…
##### Hash Week! (Part 3)
Over the last two days we've talked about hash functions and their uses in cryptography and elsewhere. Remember that an ideal hash function is basically what cryptographers call a random oracle - given an input, it produces a random number in some range. (In practice this range is always [0,2^(2^n…
##### Sunday Function
You're a member of the French Resistance in the height of WWII. You're part of a network of resistance members who have to work with other resistance members they've never met before. For instance, an agent from Paris might have to meet up with an agent in Normandy to work together on sabotage…
##### How Not to Do Message Integrity, featuring CBC-MAC
In my last cryptography post, I wrote about using message authentication codes (MACs) as a way of guaranteeing message integrity. To review briefly, most ciphers are designed to provide message confidentiality - which means that no one but the sender and the intended receiver can see the plain-…