Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for efficient diffing of huge files

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:

  1. The files are too big to be analyzed by any diff library I know of because they are in-memory
  2. I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

like image 926
usr Avatar asked Jan 08 '10 19:01

usr


People also ask

What is a Diffing algorithm?

Diffing is a heuristic algorithm based on two assumptions: Two elements of different types will produce different trees. The developer can hint at what elements will remain stable across renders with a key prop.

What algorithm does git diff use?

Myers. Myers algorithm was developed by Myers (1986). In the git diff command, this algorithm is used as the default. The operation of this algorithm traces the two primary identical sequences recursively with the least edited script.


1 Answers

You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

  1. Create a signature of one file, with e.g.

    rdiff signature A sig.txt
    
  2. using the generated signature file sig.txt and the other big file, create the delta:

    rdiff delta sig.txt B delta
    
  3. now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run

    rdiff patch A delta B
    

In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

like image 60
martinus Avatar answered Sep 22 '22 14:09

martinus