I'm on a personal quest to learn how the rsync algorithm works. After some reading and thinking, I've come up with a situation where I think the algorithm fails. I'm trying to figure out how this is resolved in an actual implementation.
Consider this example, where A is the receiver and B is the sender.
A = abcde1234512345fghij
B = abcde12345fghij
As you can see, the only change is that 12345
has been removed.
Now, to make this example interesting, let's choose a block size of 5 bytes (chars). Hashing the values on the sender's side using the weak checksum gives the following values list.
abcde|12345|fghij
abcde -> 495
12345 -> 255
fghij -> 520
values = [495, 255, 520]
Next we check to see if any hash values differ in A. If there's a matching block we can skip to the end of that block for the next check. If there's a non-matching block then we've found a difference. I'll step through this process.
abcde -> 495
(yes, so skip)12345 -> 255
(yes, so skip)12345 -> 255
(yes, so skip)fghij -> 520
(yes, so skip)Since every hash was found in the values list, we conclude that A and B are the same. Which, in my humble opinion, isn't true.
It seems to me this will happen whenever there is more than one block that share the same hash. I know I have skipped the step of calculating and checking the strong hash, but that won't make a difference since the second and third blocks are exactly the same
What am I missing?
The rsync algorithm sends two checksums: one for each chunk, and a "rolling" checksum for the whole file. In your example, A will see a difference in the rolling checksum once it gets to the "doubled-up" block.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With