One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered: If on my server, in PHP I produce: <pre class="prettyprint"><code>md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e" </code></pre> When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value. Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value? What is it about these functions that makes the resulting strings impossible to retrace?

The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output. If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?

Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible. Consider this function (in PHP notation, as the question): <pre class="prettyprint"><code>function simple_hash($input) { return bin2hex(substr(str_pad($input, 16), 0, 16)); } </code></pre> This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part). <pre class="prettyprint"><code>print simple_hash("stackoverflow.com"); </code></pre> This will output: <pre class="prettyprint"><code>737461636b6f766572666c6f772e636f6d </code></pre> This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective. But in this case, it is trivial to find a string which maps to the same hash (just apply <code>hex2bin</code> on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack). The important assumptions for a cryptographic hash function are: <ul> <li>it is hard to find any string producing a given hash (preimage resistance)</li> <li>it is hard to find any different string producing the same hash as a given string (second preimage resistance)</li> <li>it is hard to find any pair of strings with the same hash (collision resistance)</li> </ul> Obviously my <code>simple_hash</code> function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.) There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical. There is not yet a preimage attack, but attacks will get better. To answer the actual question: <blockquote> What is it about these functions that makes the resulting strings impossible to retrace? </blockquote> What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.) Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them). They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output. For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).

How come MD5 hash values are not reversible?

Tags:

cryptography

hash

md5

cryptographic-hash-function

One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:

If on my server, in PHP I produce:

md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"

When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.

Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?

What is it about these functions that makes the resulting strings impossible to retrace?

860

asked Dec 01 '08 07:12

barfoon

3 Answers

The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.

If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?

142

answered Oct 04 '22 14:10

Serafina Brocious

If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.

answered Oct 04 '22 15:10

Autodidact

Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.

Consider this function (in PHP notation, as the question):

function simple_hash($input) {
     return bin2hex(substr(str_pad($input, 16), 0, 16));
}

This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).

print simple_hash("stackoverflow.com");

This will output:

737461636b6f766572666c6f772e636f6d

This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.

But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).

The important assumptions for a cryptographic hash function are:

it is hard to find any string producing a given hash (preimage resistance)
it is hard to find any different string producing the same hash as a given string (second preimage resistance)
it is hard to find any pair of strings with the same hash (collision resistance)

Obviously my simple_hash function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)

There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical. There is not yet a preimage attack, but attacks will get better.

To answer the actual question:

What is it about these functions that makes the resulting strings impossible to retrace?

What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)

Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them). They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.

For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).

answered Oct 04 '22 13:10

Paŭlo Ebermann

Related questions
                            
                                Append key/value pair to hash with << in Ruby
                            
                                Should the hash code of null always be zero, in .NET
                            
                                Quick and Simple Hash Code Combinations
                            
                                Hashable, immutable
                            
                                Ruby - Access multidimensional hash and avoid access nil object [duplicate]
                            
                                Non-random salt for password hashes
                            
                                How to generate an MD5 file hash in JavaScript/Node.js?
                            
                                How to generate short uid like "aX4j9Z" (in JS)
                            
                                SHA-256 or MD5 for file integrity
                            
                                Hash table runtime complexity (insert, search and delete)
                            
                                Add a fragment to the URL without causing a redirect?
                            
                                Hash Table/Associative Array in VBA
                            
                                How to hash long passwords (>72 characters) with blowfish
                            
                                How do I combine hash values in C++0x?
                            
                                How to build a Ruby hash out of two equally-sized arrays?
                            
                                Why does tuple(set([1,"a","b","c","z","f"])) == tuple(set(["a","b","c","z","f",1])) 85% of the time with hash randomization enabled?
                            
                                Rails mapping array of hashes onto single hash
                            
                                Hashing with SHA1 Algorithm in C#
                            
                                Efficiently generate a 16-character, alphanumeric string
                            
                                Can two different strings generate the same MD5 hash code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With