One concept I've always wondered about is the use of cryptographic hash functions and values. I understand that these functions can generate a hash value that is unique and virtually impossible to reverse, but here's what I've always wondered:
If on my server, in PHP I produce:
md5("stackoverflow.com") = "d0cc85b26f2ceb8714b978e07def4f6e"
When you run that same string through an MD5 function, you get the same result on your PHP installation. A process is being used to produce some value, from some starting value.
Doesn't this mean that there is some way to deconstruct what is happening and reverse the hash value?
What is it about these functions that makes the resulting strings impossible to retrace?
Hash functions are not reversible in general. MD5 is a 128-bit hash, and so it maps any string, no matter how long, into 128 bits. Obviously if you run all strings of length, say, 129 bits, some of them have to hash to the same value.
It would be impossible to figure out the original data of the function with just the resulting hash – as not much of that data is left – the only workable method is to brute force every possible combination. If we could reverse a hash, we would be able to compress data of any size into a mere few bytes of data.
A hash function, by definition, cannot ever be reversed. If you can, it's not a hash. It is encoding or encryption.
The MD5 hash will change if there's any change to whatever data you input to the MD5 hashing function. If you fed it the permissions and the permission change, then the MD5 hash will change. If you fed it only the contents, then the MD5 hash will change only if the contents change.
The input material can be an infinite length, where the output is always 128 bits long. This means that an infinite number of input strings will generate the same output.
If you pick a random number and divide it by 2 but only write down the remainder, you'll get either a 0 or 1 -- even or odd, respectively. Is it possible to take that 0 or 1 and get the original number?
If hash functions such as MD5 were reversible then it would have been a watershed event in the history of data compression algorithms! Its easy to see that if MD5 were reversible then arbitrary chunks of data of arbitrary size could be represented by a mere 128 bits without any loss of information. Thus you would have been able to reconstruct the original message from a 128 bit number regardless of the size of the original message.
Contrary to what the most upvoted answers here emphasize, the non-injectivity (i.e. that there are several strings hashing to the same value) of a cryptographic hash function caused by the difference between large (potentially infinite) input size and fixed output size is not the important point – actually, we prefer hash functions where those collisions happen as seldom as possible.
Consider this function (in PHP notation, as the question):
function simple_hash($input) {
return bin2hex(substr(str_pad($input, 16), 0, 16));
}
This appends some spaces, if the string is too short, and then takes the first 16 bytes of the string, then encodes it as hexadecimal. It has the same output size as an MD5 hash (32 hexadecimal characters, or 16 bytes if we omit the bin2hex part).
print simple_hash("stackoverflow.com");
This will output:
737461636b6f766572666c6f772e636f6d
This function also has the same non-injectivity property as highlighted by Cody's answer for MD5: We can pass in strings of any size (as long as they fit into our computer), and it will output only 32 hex-digits. Of course it can't be injective.
But in this case, it is trivial to find a string which maps to the same hash (just apply hex2bin
on your hash, and you have it). If your original string had the length 16 (as our example), you even will get this original string. Nothing of this kind should be possible for MD5, even if you know the length of the input was quite short (other than by trying all possible inputs until we find one that matches, e.g. a brute-force attack).
The important assumptions for a cryptographic hash function are:
Obviously my simple_hash
function fulfills neither of these conditions. (Actually, if we restrict the input space to "16-byte strings", then my function becomes injective, and thus is even provable second-preimage resistant and collision resistant.)
There now exist collision attacks against MD5 (e.g. it is possible to produce a pair of strings, even with a given same prefix, which have the same hash, with quite some work, but not impossible much work), so you shouldn't use MD5 for anything critical. There is not yet a preimage attack, but attacks will get better.
To answer the actual question:
What is it about these functions that makes the resulting strings impossible to retrace?
What MD5 (and other hash functions build on the Merkle-Damgard construction) effectively do is applying an encryption algorithm with the message as the key and some fixed value as the "plain text", using the resulting ciphertext as the hash. (Before that, the input is padded and split in blocks, each of this blocks is used to encrypt the output of the previous block, XORed with its input to prevent reverse calculations.)
Modern encryption algorithms (including the ones used in hash functions) are made in a way to make it hard to recover the key, even given both plaintext and ciphertext (or even when the adversary chooses one of them). They do this generally by doing lots of bit-shuffling operations in a way that each output bit is determined by each key bit (several times) and also each input bit. That way you can only easily retrace what happens inside if you know the full key and either input or output.
For MD5-like hash functions and a preimage attack (with a single-block hashed string, to make things easier), you only have input and output of your encryption function, but not the key (this is what you are looking for).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With