I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:
I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?
Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.
Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.
Diffing is a heuristic algorithm based on two assumptions: Two elements of different types will produce different trees. The developer can hint at what elements will remain stable across renders with a key prop.
Myers. Myers algorithm was developed by Myers (1986). In the git diff command, this algorithm is used as the default. The operation of this algorithm traces the two primary identical sequences recursively with the least edited script.
You can use rdiff
, which works very well with large files. Here I create a diff of two big files A
and B
:
Create a signature of one file, with e.g.
rdiff signature A sig.txt
using the generated signature file sig.txt
and the other big file, create the delta:
rdiff delta sig.txt B delta
now delta
contains all the information you need to recreate file B
when you have both A
and delta
. To recreate B, run
rdiff patch A delta B
In Ubuntu, just run sudo apt-get install rdiff
to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With