Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine MD5 hashes of multiple files

Tags:

c#

hash

md5

I have 7 files that I'm generating MD5 hashes for. The hashes are used to ensure that a remote copy of the data store is identical to the local copy. Unfortunately, the link between these two copies of the data is mind numbingly slow. Changes to the data are very rare but I have a requirement that the data be synchronized at all times (or as soon as possible). Rather than passing 7 different MD5 hashes across my (extremely slow) communications link, I'd like to generate the hash for each file and then combine these hashes into a single hash which I can then transfer and then re-calculate/use for comparison on the remote side. If the "combined hash" differs, then I'd start sending the 7 individual hashes to determine exactly which file(s) have been changed. For example, here are the MD5 hashes for the 7 files as of last week:

0709d609d69385255c496436eb50402c
709465a74411bd596595c7b9b158ae6a
4ab657320ef33e3d5eb498e4c13d41b7
3b49c6ab199994fd776bb63761414e72
0fc28c5a010fc3c06c0c930c88e31a15
c4ecd214662cac5aae0e53f6f252bf0e
8b086431e43148a2c2d943ba30d31cc6

I'd like to combine these hashes together such that I get a single unique value (perhaps another MD5 hash?) that I can then send to the remote system. On the remote system, I'd then perform the same calculation to determine if the data as a whole has been changed. If it has, then I'd start sending the individual hashes, etc. The most important factor is that my "combined hash" be short enough so that it uses less bandwidth than just sending all 7 hashes in the first place. I thought of writing the 7 MD5 hashes to a file and then hashing that file but is there a better way?

like image 824
bmt22033 Avatar asked Dec 03 '12 04:12

bmt22033


4 Answers

Why don't you:

  • Generate the 7 MD5 hashes (which is what you are doing now), and then
  • Combine these 7 hash outputs into a larger byte array and MD5 hash that to produce an overall hash. (Each MD5 hash is 16 bytes, so you will end up with a 112 byte array which you will hash to get the overall hash).

If your overall hash matches with the other end, then nothing needs to be done. If not, then you start to send over your intermediate 7 hashes to work out which file(s) have changed.

like image 87
BlokeTech Avatar answered Sep 18 '22 17:09

BlokeTech


You could just calculate a hash of the contents of all seven files concatenated together.

However, I don't recommend that, because you will open yourself up to subtle bugs, like:

file1: 01 02 03 04 file2: 05 06 07 08

will hash the same as

file1: 01 02 file2: 03 04 05 06 07 08

How slow is your comm link? a single MD5 hash is 32 bytes.

7 of them is less than 1/4 KB; that's just not much data.

On what side of the link are the files going to change?

You could cache a set of MD5s on that side, and then compare the files to the cached hashes on a regular-basis, and then kick off a transfer when you notice a difference.

like image 31
Marshall Clow Avatar answered Sep 19 '22 17:09

Marshall Clow


XOR them all.

As I know it's the most simple and effective solution.

like image 24
mixel Avatar answered Sep 17 '22 17:09

mixel


Another option is to generate a single hash in the first place - see https://stackoverflow.com/a/15683147/188926

This example iterates all files in a folder, but you could iterate over your list of files instead.

like image 38
Dunc Avatar answered Sep 19 '22 17:09

Dunc