I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem: <ol> <li>The files are too big to be analyzed by any diff library I know of because they are in-memory</li> <li>I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".</li> </ol> I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this? Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar. Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

You can use <code>rdiff</code>, which works very well with large files. Here I create a diff of two big files <code>A</code> and <code>B</code>: <ol> <li> Create a signature of one file, with e.g. <pre class="prettyprint"><code>rdiff signature A sig.txt </code></pre> </li> <li> using the generated signature file <code>sig.txt</code> and the other big file, create the delta: <pre class="prettyprint"><code>rdiff delta sig.txt B delta </code></pre> </li> <li> now <code>delta</code> contains all the information you need to recreate file <code>B</code> when you have both <code>A</code> and <code>delta</code>. To recreate B, run <pre class="prettyprint"><code>rdiff patch A delta B </code></pre> </li> </ol> In Ubuntu, just run <code>sudo apt-get install rdiff</code> to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

Algorithm for efficient diffing of huge files

Tags:

algorithm

diff

rcs

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:

The files are too big to be analyzed by any diff library I know of because they are in-memory
I don't actually need a diff - a diff typically has inserts, edits and deletes because it is meant to be read by humans. I can get away with less information: I only need "new range of bytes" and "copy bytes from old file from arbitrary offset".

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

926

asked Jan 08 '10 19:01

usr

1 Answers

You can use rdiff, which works very well with large files. Here I create a diff of two big files A and B:

Create a signature of one file, with e.g.
```
rdiff signature A sig.txt
```
using the generated signature file sig.txt and the other big file, create the delta:
```
rdiff delta sig.txt B delta
```
now delta contains all the information you need to recreate file B when you have both A and delta. To recreate B, run
```
rdiff patch A delta B
```

In Ubuntu, just run sudo apt-get install rdiff to install it. It is quite fast, I get about 40 MB per second on my PC. I have just tried it on a 8GB file, and the memory used by rsync was about 1MB.

answered Sep 22 '22 14:09

martinus

Related questions
                            
                                Python - Compress Ascii String
                            
                                Getting the number of trailing 1 bits
                            
                                Algorithm to locate local maxima
                            
                                How to perform binary search on NSArray?
                            
                                How to find if a graph is bipartite?
                            
                                Best articles to start learning about edge detection/image recognition
                            
                                Determining whether or not a directed or undirected graph is a tree
                            
                                Artificial Neural Network Question
                            
                                Data structure for choosing random elements?
                            
                                Why will std::sort crash if the comparison function is not as operator <?
                            
                                How can you test how many instructions per second your computer can do?
                            
                                How to sort a list when certain values must appear later than others, potentially ignoring sort order for such items that need 'delaying' [duplicate]
                            
                                Efficient algorithm for finding spheres farthest apart in large collection
                            
                                Runner technique to combine two equal Linked Lists
                            
                                algorithms: how do divide-and-conquer and time complexity O(nlogn) relate?
                            
                                Algorithm to create fair / evenly matched teams based on player rankings
                            
                                Assigning people to buildings while respecting preferences?
                            
                                Bloom filter usage
                            
                                Is Quicksort a potential security risk?
                            
                                Cut rectangle in minimum number of squares

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With