Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Commutative, accumulator-based function for calculating a digest of multiple hashes

Tags:

hash

I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the hash of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.

The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.

The only two methods I can think of are:

  1. Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.

  2. Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.

It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).

Thanks.

like image 374
jthg Avatar asked Jul 06 '10 23:07

jthg


1 Answers

Order independent hashing of collections of hashes (is essentially what you're looking for, non?)

It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.

In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.

like image 177
Slartibartfast Avatar answered Oct 09 '22 03:10

Slartibartfast