Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimal Difference Patch Algorithm

I'm trying to convey the difference between two bytestreams. I want to minimize the number of bytes in the patch.

(I don't necessarily want to minimize the number of "changes" in the diff, which is what the optimal patch in a levenshtein distance computation would give me.)

The patch would ideally be in a format such that, given the source bytestream and the diff, it would be easy to reconstruct the target bytestream.

Is there a good algorithm for doing this?


Edit: For the record, I've tried sending changes of the form "at spot 506, insert the following the bytes...", where I create a change list from the levenshtein distance algorithm.

The problem I have is that the levenshtein distance algorithm gives me a lot of changes like:

at spot 506 substitute [some bytes1]
at spot 507 do nothing
at spot 508 substitute [some bytes2]
at spot 509 do nothing
at spot 510 substitute [some bytes3]
...

This is because the lev distance algorithm tries to minimize the number of changes. However, for my purposes this instruction set is wasteful. It would probably be better if an algorithm just said,

At spot 506 substitute [some bytes1, [byte at spot 507], some bytes2, [byte at spot 509], some bytes3, ...]

There's probably some way to modify lev distance to favor these types of changes but it seems a little tricky. I could coalesce substituions after getting a changelist (and I'm going to try that) but there may be opportunities to coalesce deletions / inserts too, and it's less obvious how to do that correctly.

Just wondering if there's a special purpose algorithm for this (or if somebody's done a modification of lev distance to favor these types of changes already).

like image 586
user1435114 Avatar asked Dec 30 '25 17:12

user1435114


2 Answers

You can do this using pairwise alignment with affine gap costs, which takes O(nm) time for two strings of lengths n and m respectively.

One thing first: There is no way to find a provably minimal patch in terms of bits or bytes used. That's because if there was such a way, then the function shortest_patch(x, y) that calculates it could be used to find a provably minimal compression of any given string s by calling it with shortest_patch('', s), and Kolmogorov complexity tells us that the shortest possible compression of a given string is formally uncomputable. But if edits tend to be clustered in space, as it seems they are here, then it's certainly possible to find smaller patches than those produced using the usual Levenshtein distance algorithm.

Edit scripts

Patches are usually called "edit scripts" in CS. Finding a minimal (in terms of number of insertions plus number of deletions) edit script for turning one string x into another string y is equivalent to finding an optimal pairwise alignment in which every pair of equal characters has value 0, every pair of unequal characters has value -inf, and every position in which a character from one string is aligned with a - gap character has value -1. Alignments are easy to visualise:

st--ing    st-i-ng
stro-ng    str-ong

These are 2 optimal alignments of the strings sting and strong, each having cost -3 under the model. If pairs of unequal characters are given the value -1 instead of -inf, then we get an alignment with cost equal to the Levenshtein distance (the number of insertions, plus the number of deletions, plus the number of substitutions):

st-ing    sti-ng
strong    strong

These are 2 optimal alignments under the new model, and each has cost -2.

To see how these correspond with edit scripts, we can regard the top string as the "original" string, and the bottom string as the "target" string. Columns containing pairs of unequal characters correspond to substitutions, the columns containing a - in the top row correspond to insertions of characters, and the columns containing a - in the bottom row correspond to deletions of characters. You can create an edit script from an alignment by using the "instructions" (C)opy, (D)elete, (I)nsert and (S)ubstitute. Each instruction is followed by a number indicating the number of columns to consume from the alignment, and in the case of I and S, a corresponding number of characters to insert or replace with. For example, the edit scripts for the previous 2 alignments are

C2, I1"r", S1"o", C2     and     C2, S1"r", I1"o", C2

Increasing bunching

Now if we have strings like mississippi and tip, we find that the two alignments

mississippi
------tip--

mississippi
t---i----p-

both have the same score of -9: they both require the same total number of insertions, deletions and substitutions. But we much prefer the top one, because its edit script can be described much more succinctly: D6, S1"t", C2, D2. The second's edit script would be S1"t", D3, C1, D4, C1, D1.

In order to get the alignment algorithm to also "prefer" the first alignment, we can adjust gap costs so that starting a blocks of gaps costs more than continuing an existing block of gaps. If we make it so that a column containing a gap costs -2 instead of -1 when the preceding column contains no gap, then what we are effectively doing is penalising the number of contiguous blocks of gaps (since each contiguous block of gaps must obviously have a first position). Under this model, the first alignment above now costs -11, because it contains two contiguous blocks of gaps. The second alignment now costs -12, because it contains three contiguous blocks of gaps. IOW, the algorithm now prefers the first alignment.

This model, in which every aligned position containing a gap costs g and the first position in any contiguous block of gap columns costs g + s, is called the affine gap cost model, and an O(nm) algorithm was given for this by Gotoh in 1982: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/archive/JMB82.pdf. Increasing the gap-open cost s will cause aligned segments to bunch together. You can play with the various cost parameters until you get alignments (corresponding to patches) that empirically look about right and are small enough.

like image 85
j_random_hacker Avatar answered Jan 01 '26 21:01

j_random_hacker


There are two approaches to solving this kind of problem:

1) Establish a language for X (edit scripts, in this case), and figure out how to minimize the length of the applicable sentence; or,

2) Compute some kind of minimum representation for Y (string differences), and then think up a way to represent that in the shortest form.

The Myers paper demonstrates that for a particular language, finding the minimum set of changes and finding the minimum length of the change representation are the same problem.

Obviously, changing the language might invalidate that assumption, and certain changes might be extremely complicated to apply correctly (for example, suppose the language included the primitive kP which means to remove the next k characters whose indices are prime. For certain diffs, using that primitive might turn out to be a huge win, but the applications are probably pretty rare. It's an absurd example, I know, but it demonstrates the difficulty of starting with a language.

So I propose starting with the minimum change list, which identifies inserts and deletes. We translate that in a straightforward way to a string of commands, of which there are exactly three. There are no indices here. The idea is that we start with a cursor at the beginning of the original string, and then execute the commands in sequence. The commands are:

=  Advance the cursor without altering the character it points to
Ic Insert the character `c` before the cursor.
D  Delete the character at the cursor.

Although I said there were exactly three commands, that's not quite true; there are actually A+2 where A is the size of the alphabet.

This might result in a string like this:

=========================IbIaInIaInIaDD=D=D============================

Now, let's try to compress this. First, we run-length encode (RLE), so that every command is preceded by a repeat count, and we drop the trailing =s

27=1Ib1Ia1In1Ia1In1Ia2D1=1D1=1D

(In effect, the RLE recreates indices, although they're relative instead of absolute).

Finally, we use zlib to compress the resulting string. I'm not going to do that here, but just to give some idea of the sort of compression it might come up with:

27=1Ib1Ia1In||2D1=1D|
      ______+|  ____+
      ___<---+

(Trying to show the back-references. It's not very good ascii art, sorry.)

Liv-Zempell is very good at finding and optimizing unexpected repetitions. In fact, we could have just used it instead of doing the intermediate RLE step, but experience shows that in cases where RLE is highly effective, it's better to LZ the RLE than the source. But it would be worth trying both ways to see what's better for your application.

like image 27
rici Avatar answered Jan 01 '26 22:01

rici