I want to calculate sha1 hash of a set (unordered list) of elements. I have already calculated sha1 hash of each element. I'm considering two solutions:
Sort elements by their hashes and calculate top hash of such list.
Treat element hashes as 160 bits integer values and XOR (bitwise operation) them together into one 160 bits hash.
Does second solution is weaker in terms of secure hash function properties? (pre-image resistance, second pre-image resistance, collision resistance).
Option 1 is what is done in ERS: that standard uses hash trees, where each node contains a hash value computed over the set of hash values from the child nodes; since order is not significant in the tree, the values are sorted lexicographically before hashing. This is good, and, as far as we know, safe.
Option 2 is very unsafe: if the hash function has 160-bit output, then I can easily generate 160 random inputs such that the corresponding hash values constitute a basis of the vector space GF(2)160, at which point I can produce a matching set for any aggregate hash value. Attack cost is negligible.
Option 3 suggested by @paj28 (sorting the values to hash, then hash them) is fine, too, as long as you "concatenate" the sorted values with an unambiguous separator. For instance, if you hash the set of strings containing "bar" and "foo", you don't want to obtain the same hash value as with the set of strings containing "ba" and "rfoo". It is easier to get something safe when all values to hash have the same length.
Therefore, use option 1: hash each value in the set, then sort the hash values in lexicographic order, and hash the sorted list of values again.
On the attack with option 2: this is linear algebra. Suppose that you have k vectors of n bits, such that none of them is equal to the XOR of some of the k-1 other vectors (they are said to be linearly independent). Then consider a new random vector v; the probability that this vector is equal to the XOR of some of the k vectors is equal to 2k-n, i.e. it is small as long as k < n. If the new vector v indeed linearly independent with the k vectors you already have (thus with probability 1-2k-n), then add it to the set: you now have k+1 linearly independent vectors.
Recurse: you will soon obtain n vectors of n bits which are linearly independent to each other. But you cannot go further, because probability of any new vector to be linearly independent from the n previous has dropped to 0. The n vectors are said to be a basis for the vector space.
In this case, the vectors are obtained by simply hashing values (random values, or values with structure, it does not matter much, because the hash function acts as a randomizer).
For a given set of k vectors, determining whether a new vector v is linearly independent with the k vectors is easy with Gaussian elimination. The same algorithm lets you know, once you have a basis, which of your n basis vectors shall be XORed together to yield any vector v'. In the setup of this question, this means that once I have produced n values mi such that the h(mi) constitute a basis, then for any target n-bit output t, I can use Gauss elimination to work out which of my h(mi) may be XORed together to yield exactly the value t. The corresponding mi values are then a preimage set for t.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With