Having a (possibly large) list of unique text lines (stringified JSON data) I need to calculate a unique hash for the whole text document. Often new lines are appended to the document and occasionally some lines will be deleted from it, resulting into a completely new hash for the document.
The ultimate goal is to be able to identify identical documents just using the hash.
Of course, calculating the SHA1 hash for the whole document after each modification would give me the desired unique hash, but would also be computationally expensive - especially in a situation where just ~40 bytes are appended to a 5 megabyte document and all that data would have to go through the SHA1 calculation again.
So, I'm looking into a solution that allows me to reduce the time it takes to calculate the new hash.
A summary of the problem properties / requirements:
My current idea is to calculate the SHA1 (or whatever) hash for each single line individually and then XOR the hashes together. That should satisfy all requirements. For new lines I just calculate the SHA1 of that line and XOR it with the already known sum.
However, I'm in doubt because...
Anybody can shed some light into these issues?
Alternatively, is it perhaps generally possible with SHA1 (or a similar hash) to quickly generate a new hash for appended data (old hash
+ appended data
= new hash
)?
There are problems with hashing each file individually.
If two identical lines are added the combined xor will not change.
You might be better off hashing all the individual line hashes.
Perhaps use a Merkle Tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With