Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

quickly (XOR-?) combine SHA1 hashes to generate a new hash

Having a (possibly large) list of unique text lines (stringified JSON data) I need to calculate a unique hash for the whole text document. Often new lines are appended to the document and occasionally some lines will be deleted from it, resulting into a completely new hash for the document.

The ultimate goal is to be able to identify identical documents just using the hash.

Of course, calculating the SHA1 hash for the whole document after each modification would give me the desired unique hash, but would also be computationally expensive - especially in a situation where just ~40 bytes are appended to a 5 megabyte document and all that data would have to go through the SHA1 calculation again.

So, I'm looking into a solution that allows me to reduce the time it takes to calculate the new hash.

A summary of the problem properties / requirements:

  • each line is guaranteed to be unique
  • the order of the lines does not necessarily need to matter (even better if it doesn't)
  • the length of a single line is usually small, but the whole document might be large
  • the algorithm can be optimized for appended data (i.e. deleting data might even require a restart from scratch in such a case)

My current idea is to calculate the SHA1 (or whatever) hash for each single line individually and then XOR the hashes together. That should satisfy all requirements. For new lines I just calculate the SHA1 of that line and XOR it with the already known sum.

However, I'm in doubt because...

  • I 'm not sure if the XORed hash would still be strong enough to exactly identify a document (ie. is there a significantly higher probability of unwanted collisions?)
  • calculating lots of SHA1 hashes of short lines might be computationally expensive of it's own (at least during initialization)

Anybody can shed some light into these issues?

Alternatively, is it perhaps generally possible with SHA1 (or a similar hash) to quickly generate a new hash for appended data (old hash + appended data = new hash)?

like image 858
Udo G Avatar asked Oct 30 '22 01:10

Udo G


1 Answers

There are problems with hashing each file individually.

If two identical lines are added the combined xor will not change.

You might be better off hashing all the individual line hashes.

Perhaps use a Merkle Tree.

like image 60
zaph Avatar answered Nov 12 '22 16:11

zaph