Hash Week!
https://www.scienceblogs.com/
enHash Week! (Part 3)
https://www.scienceblogs.com/builtonfacts/2012/10/10/hash-week-part-3
<span>Hash Week! (Part 3)</span>
<div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>Over the last two days we've talked about hash functions and their uses in cryptography and elsewhere. Remember that an ideal hash function is basically what cryptographers call a <em>random oracle</em> - given an input, it produces a random number in some range. (In practice this range is always [0,2^(2^n)], with n usually equal to 5 or 6 for non-cryptographic hashes or n equal to 7, 8, or 9 for cryptographic hashes.) This random number is deterministic, in that the same input always produces the same output. But the output is otherwise unpredictable. Given an output, it should not be possible to find a corresponding input except by brute-force calculation of every possible input. (And the other conditions we discussed yesterday.)</p>
<p>So how does a hash function actually work? One possibility might be just to add up the numbers in the input, divide by some chosen constant, and output the remainder as the hash value. In ASCII code, the letters of my first name, added together, give</p>
<p>"Matthew" = 77 + 97 + 116 + 116 + 104 + 101 + 119 = 730</p>
<p>If I want the hash to output a value between 0 and 32, I can divide by 32 and take the remainder as my hash. 32 goes into 730 a total of 22 times, with 26 left over. So the hash of my name with this very simple system is 26.</p>
<p>Some of the non-cryptographic hashes like CRC-32 are actually almost this simple. They're very useful for error-checking and the like, but they fail the "random oracle" criteria completely. It's easy to engineer a string with any given hash output.</p>
<p>The winner of NIST's competition to develop the new SHA-3 hash standard is an algorithm named <a href="https://en.wikipedia.org/wiki/Keccak">Keccak</a>. Internally it's pretty complicated, but conceptually it's simple. In cryptographic terms it follows the sponge construction, which describes the way that it "soaks up" the input and "squeezes out" the output. It works like this: Keccak breaks up the input into blocks of about a hundred bytes. It takes the first block and dumps it into the hash's internal memory. It scrambles the internal memory around with what amounts to a complicated shuffle. Then it takes the next block and combines it with the internal memory (via an XOR operation) and shuffles the internal memory again. Then it takes the next block, combines it with the internal memory, and shuffles again. This repeats until the hash has processed the entire input. When that happens, the output of the hash is just the internal state at the end.</p>
<p>That glosses over a lot of subtitles and completely ignore the details of the shuffle. I'm not going to try to explain those subtitles or those details because I don't understand them very well - I'm not at all a cryptographer, just an interested amateur. Still, those details are interesting and I encourage you to take a look at the Wikipedia article.</p>
</div>
<span><a title="View user profile." href="https://www.scienceblogs.com/author/mspringer" lang="" about="https://www.scienceblogs.com/author/mspringer" typeof="schema:Person" property="schema:name" datatype="" xml:lang="">mspringer</a></span>
<span>Wed, 10/10/2012 - 13:07</span>
<div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline">
<div class="field--label">Tags</div>
<div class="field--items">
<div class="field--item"><a href="https://www.scienceblogs.com/tag/hash-week" hreflang="en">Hash Week!</a></div>
</div>
</div>
<section></section><ul class="links inline list-inline"><li class="comment-forbidden"><a href="https://www.scienceblogs.com/user/login?destination=/builtonfacts/2012/10/10/hash-week-part-3%23comment-form">Log in</a> to post comments</li></ul>Wed, 10 Oct 2012 17:07:32 +0000mspringer121030 at https://www.scienceblogs.comHash Week! (Part 2)
https://www.scienceblogs.com/builtonfacts/2012/10/09/hash-week-part-2
<span>Hash Week! (Part 2)</span>
<div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>Yesterday we looked at hash functions. As you recall, they're functions which take an input and generate a random-seeming output. As a quick example, here's the output of the SHA-256 hash function for the name of the Scottish physicist James Maxwell and a misspelling thereof:</p>
<p>SHA256("James Clerk Maxwell") = 2667629603913530690117759428994407894024237387971995154086108064226397\<br />
5353322</p>
<p>SHA256("James Clark Maxwell") = 9129664885155451589341762461551711693832872424126676652783015499131718\<br />
4589063</p>
<p>A tiny change in the input generates a wildly different output, so tt looks like SHA256 is a pretty good hash function. For every input, it dumps out some 256-bit number that looks entirely random. For cryptographic purposes it's not enough that the digits look random, they need to satisfy three specific properties, which we'll go through one at a time.</p>
<p><strong>1. Preimage resistance.</strong></p>
<p>If I give you a <em>hash value</em>, you should not be able to find a message whose hash is that value. In other words, if I say SHA(x) = 1402163220222678497648226475128810495847235325536749812516677580084870\<br />
9608774, you should not be able to come up with some x that works. Of course you could always just start hashing random strings and odds are after about 2^256 of them you'd hit a string that hashes to that value just by chance. But 2^256 is a gigantic number and in practice you'll never be able to do it.</p>
<p>Why care about preimage resistance? With digital signature algorithms, it is possible to make mathematical versions of statements like "The owner of this cryptographic key asserts that the message with hash [some value] did in fact originate with them." If you can generate a distinct message with the same hash, this authentication can be compromised.</p>
<p><strong>2. Second-preimage resistance.</strong></p>
<p>If I give you a <em>message</em>, and you compute its hash, you should not able to generate a different message with the same hash. This is a slightly more difficult test for a hash algorithm to pass. Here the attacker effectively has two pieces of information - the original message, and its hash. If a hash algorithm is weak, the attacker might be able to tweak the original message in such a way that it still has the same hash value. You'd hate for an attacker to be able to take "Operation Overlord to commence at midnight" and generate a message like "Operation Overlord to be cancelled" and cause the replacement to have the same hash by judicious arrangements of wording or typos.</p>
<p>This might seem somewhat academic. If the attacker has access to the original message, aren't you already in deep trouble? Not always. Sometimes the message is meant to be public, and the sender is using a digital signature algorithm to sign the hash and thus verify the authenticity of the message. Your online banking website, for instance, has a public cryptographic key whose validity is checked by some certificate authority, and the certificate authority uses the hash of that public key to validate its authenticity to your browser. If it were possible to generate a fake key that hashed to the right value, this authentication would be compromised.</p>
<p><strong>3. Collision resistance.</strong></p>
<p>This is the hard one. You should not be able to find <em>any</em> two messages with the same hash. You don't care what the messages are, and you don't care what the hashes are, you just care that you can find two messages that hash to the same value.</p>
<p>Unfortunately this is much easier in general. If I want to find someone who shares my birthday, April 8, odds are I'll have to ask about 365 people before I find a match. But if I just start asking people their birthdays and I'm willing to settle for any match between any two people, I only have to ask about 26 people before I have an even chance of finding a match. In general the number of samples I need to find a birthday match - or a hash collision - is proportional to the square root of the number of different possible birthdays - or hashes. So if I have a 128-bit hash, there's some 10^38 possible hashes, but I only have to hash some 2^64 = 10^19 strings before I find a collision. And while that's a big number, it's not inconceivable that a collision could be generated by brute force checking of 10^19 hashes. But it would be tough.</p>
<p>If I need more security than that, I can just use a bigger hash. If I use a 256-bit hash then I'd still have to check some 2^128 = 10^38 hashes before a collision. And that's a ridiculously huge number even for computers. Or I could use a 512-bit hash and I'd have an inconceivably intractable 10^77 hashes to calculate before I'm likely to find a collision.</p>
<p>That is, if my hash function is collision-resistant. One of the most common cryptographic hash functions is called MD5, and cryptographers have figured out a way to easily generate collisions with that function. Even though it's a 128-bit hash, it only takes a home computer a few seconds to generate a collision with the right algorithms.</p>
<p>While the ability to generate an a collision with possibly random-looking messages might not seem so bad, in practice a sufficiently clever attacker can use collisions to compromise security. For instance, the authors of the Flame malware that attacked Iran's computer systems were able to use an MD5 collision to generate a fake "This software came from Microsoft and is trustworthy" certificate.</p>
<p>With the widely used MD5 hash comprehensively broken, its replacement SHA-1 showing serious theoretical weaknesses, and the newer SHA-2 possibly vulnerable to similar attacks, NIST decided to put out a call for proposals for a new hash function whose design takes into account the great advances in cryptography over the last decade or two. Tomorrow we'll talk a about the just-announced winner of NIST's competition.</p>
<p>[The SHA-2 hash comes in several variants with different bit sizes. The SHA-256 hash at the beginning of this post is in the SHA-2 family of hashes. This awkward naming convention has been the subject of a considerable dust-up in the SHA-3 mailing list, as interested parties debate various alternatives to long designations like SHA-3-256.]</p>
</div>
<span><a title="View user profile." href="https://www.scienceblogs.com/author/mspringer" lang="" about="https://www.scienceblogs.com/author/mspringer" typeof="schema:Person" property="schema:name" datatype="" xml:lang="">mspringer</a></span>
<span>Tue, 10/09/2012 - 09:20</span>
<div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline">
<div class="field--label">Tags</div>
<div class="field--items">
<div class="field--item"><a href="https://www.scienceblogs.com/tag/hash-week" hreflang="en">Hash Week!</a></div>
</div>
</div>
<section></section><ul class="links inline list-inline"><li class="comment-forbidden"><a href="https://www.scienceblogs.com/user/login?destination=/builtonfacts/2012/10/09/hash-week-part-2%23comment-form">Log in</a> to post comments</li></ul>Tue, 09 Oct 2012 13:20:09 +0000mspringer121029 at https://www.scienceblogs.comHash Week! (Part 1)
https://www.scienceblogs.com/builtonfacts/2012/10/08/hash-week-part-1
<span>Hash Week! (Part 1)</span>
<div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"><p>Last week NIST anounced the winner of its Cryptographic Hash Function Competition. After five years of review and many rounds of discussion and elimination, the winner is a hash function called <a href="http://keccak.noekeon.org/">Keccak</a>, and its developers deserve many congratulations. It's a shame hash functions aren't better known in the general public, because not only are they a vital part of keeping data safe online, they're one of the most interesting bits of applied math. Better still, their basic concept is not complicated at all.</p>
<p>Hash functions are so cool, in fact, that I want to spend several posts this week discussing them. Today we'll just talk about what they are, and why they might be useful.</p>
<p>A hash function is just a mathematical function like any other. You put in a number, and the function spits out another number. In high school you dealt with functions like $latex f(x) = x^2$. Put in 2, it spits out 4. Put in 3, it spits out 9. A hash function is less predictable. It's <em>designed</em> to be less predictable. In fact, a working definition of "hash function" is a function whose output is a number with no obvious relation to the input.</p>
<p>Computers internally represent text as numbers, so the first letter of my name - "M" - could be represented by its numeric ASCII code - 77 - and fed into a function. Here we won't worry about the details of the text-to-number conversion. We'll just say that it's easy for a computer do so we'll just write the input of the function as text (It could also be an image, movie, or any other kind of digital data.). One hash function is called CRC-32, and if we denote it c(x) some examples of its output are:</p>
<p>c("Matt Springer") = 2690866847</p>
<p>c("ScienceBlogs") = 385760650</p>
<p>c("Science Blogs") = 531647807</p>
<p>Even very similar inputs can result in very different outputs, as expected. It's also important to notice that CRC-32 is a 32 bit hash, which just means every output is a number between 0 and 2^32 (which is 4,294,967,296). Of course there's more than four billion possible pieces of data in the world, and therefore some identical inputs which will correspond to the same outputs. While two random inputs will only have a 1 in 2^32 chance of producing identical hashes, the <a href="https://en.wikipedia.org/wiki/Birthday_paradox">birthday paradox</a> means that if you have a bunch of random inputs, odds are you only need on average 2^(32/2) = 65,536 inputs before you end up with a duplicate hash. That's not so many. There are even single words in the dictionary that have the same CRC-32 hash. Both "plumless" and "buckeroo" hash to 1306201125, for instance.</p>
<p>If you really really need to avoid duplicates, there no alternative but to use a hash function with a longer output. 128, 256, and 512 bits are common sizes. With a 256 bit hash, there's 2^256 possible outputs (more than 10^77) and odds are you'd need about 2^(256/2) random messages (about 10^38) before you end up with a duplicate hash. Thee are astronomical numbers, and unless something is mathematically wrong with the particular hash function in question we will never live to see see two different inputs produce the same 256-bit hash. One 256 bit hash is called SHA-512, and my name hashed using that function is</p>
<p>SHA-256("Matt Springer") = 20005913487026535327234686971684230532576326699423435238922925367385304409606</p>
<p>Which is a gigantic number, and vanishingly unlikely to recur by chance for a different input.</p>
<p>Ok, so a hash function takes some data as its input and spits out a seemingly-random number. If I download the text of <em>Moby Dick</em> from Project Gutenberg, I can compute the CRC-32 hash of the entire book. It turns out to be 1206970038. If I change one single random letter of the text, the hash is different (It became 3207150610, for the random single-letter change I made). This suggests that hashes are useful in error detection. If you download some huge file for which message integrity is extremely important, one way to be sure you got the file without any errors is for the file provider to compute and post the hash along with the download link. When you're finished downloading, you compute the hash of your copy and check to make sure it's the same as the one posted by the file provider.</p>
<p>You can do quite a bit more than just error-detection with hashes. One example is password verification. If I run a bank and you want to do online banking, it might be safer for me as the bank not to keep your password on file. That way it can't be stolen. But then how can I verify you have the right password when you type it in? Simple. When you set up your password, I compute its hash and keep that on file. Every subsequent time you long it, I take the password you give me, hash it, and check to see if it matches the original hash. That way I can check to see if you know your password without having to keep a permanent record of your password on my server.</p>
<p>You can also use hashes to assign ID numbers. You can use them to generate pseudo-random numbers. You can use them to efficiently address data in a computer memory system. There are many other applications. Most importantly, you can use them in cryptography. But cryptography means you might have intelligent, well-equipped adversaries trying to break your codes. For these applications, which will be tomorrow's subject, we can't just say "well, the output of our hash looks pretty random" and call it a day. Bad hash functions in cryptographic applications can compromise everything from e-commerce to national security, and this is why NIST has spent so much time and effort testing and publicly reviewing potential candidates for the next generation of cryptographic hash functions.</p>
<p>If you want to play around with some hashes yourself, there's many online calculators <a href="http://www.fileformat.info/tool/hash.htm?text=Matt+Springer">such as this one</a> that you can use. Give it a try!</p>
<p>[Note: hash outputs are almost always actually written in hexadecimal notation. The CRC-32 hash of my name would usually be written as a0635e9f, which is both more compact than the decimal form and easier for a computer to work with. I've also picked CRC-32 as my example because it's common and easy to work with. It's really an error-correction code rather than strictly a hash, but for non-cryptographic purposes it doesn't really matter.]</p>
</div>
<span><a title="View user profile." href="https://www.scienceblogs.com/author/mspringer" lang="" about="https://www.scienceblogs.com/author/mspringer" typeof="schema:Person" property="schema:name" datatype="" xml:lang="">mspringer</a></span>
<span>Mon, 10/08/2012 - 09:11</span>
<div class="field field--name-field-blog-tags field--type-entity-reference field--label-inline">
<div class="field--label">Tags</div>
<div class="field--items">
<div class="field--item"><a href="https://www.scienceblogs.com/tag/hash-week" hreflang="en">Hash Week!</a></div>
</div>
</div>
<section></section><ul class="links inline list-inline"><li class="comment-forbidden"><a href="https://www.scienceblogs.com/user/login?destination=/builtonfacts/2012/10/08/hash-week-part-1%23comment-form">Log in</a> to post comments</li></ul>Mon, 08 Oct 2012 13:11:07 +0000mspringer121028 at https://www.scienceblogs.com