I have 7 files that I'm generating MD5 hashes for. The hashes are used to ensure that a remote copy of the data store is identical to the local copy. Unfortunately, the link between these two copies of the data is mind numbingly slow. Changes to the data are very rare but I have a requirement that the data be synchronized at all times (or as soon as possible). Rather than passing 7 different MD5 hashes across my (extremely slow) communications link, I'd like to generate the hash for each file and then combine these hashes into a single hash which I can then transfer and then re-calculate/use for comparison on the remote side. If the "combined hash" differs, then I'd start sending the 7 individual hashes to determine exactly which file(s) have been changed. For example, here are the MD5 hashes for the 7 files as of last week:
0709d609d69385255c496436eb50402c
709465a74411bd596595c7b9b158ae6a
4ab657320ef33e3d5eb498e4c13d41b7
3b49c6ab199994fd776bb63761414e72
0fc28c5a010fc3c06c0c930c88e31a15
c4ecd214662cac5aae0e53f6f252bf0e
8b086431e43148a2c2d943ba30d31cc6
I'd like to combine these hashes together such that I get a single unique value (perhaps another MD5 hash?) that I can then send to the remote system. On the remote system, I'd then perform the same calculation to determine if the data as a whole has been changed. If it has, then I'd start sending the individual hashes, etc. The most important factor is that my "combined hash" be short enough so that it uses less bandwidth than just sending all 7 hashes in the first place. I thought of writing the 7 MD5 hashes to a file and then hashing that file but is there a better way?
Why don't you:
If your overall hash matches with the other end, then nothing needs to be done. If not, then you start to send over your intermediate 7 hashes to work out which file(s) have changed.
You could just calculate a hash of the contents of all seven files concatenated together.
However, I don't recommend that, because you will open yourself up to subtle bugs, like:
file1: 01 02 03 04 file2: 05 06 07 08
will hash the same as
file1: 01 02 file2: 03 04 05 06 07 08
How slow is your comm link? a single MD5 hash is 32 bytes.
7 of them is less than 1/4 KB; that's just not much data.
On what side of the link are the files going to change?
You could cache a set of MD5s on that side, and then compare the files to the cached hashes on a regular-basis, and then kick off a transfer when you notice a difference.
XOR
them all.
As I know it's the most simple and effective solution.
Another option is to generate a single hash in the first place - see https://stackoverflow.com/a/15683147/188926
This example iterates all files in a folder, but you could iterate over your list of files instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With