Is git's implementation of the patience diff algorithm correct?

Tags:

This question on Stackoverflow seemed to be a good candidate for the application of the patience diff algorithm. However, while testing my potential answer, I found that git diff --patience doesn't work up to my expectations (and, in this case, is no different from the default diff algorithm):

$ cat a
/**
 * Function foo description.
 */
function foo() {}

/**
 * Function bar description.
 */
function bar() {}

$ cat b
/**
 * Function bar description.
 */
function bar() {}

$ git diff --no-index --patience a b
diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,9 +1,4 @@
 /**
- * Function foo description.
- */
-function foo() {}
-
-/**
  * Function bar description.
  */
 function bar() {}

I would expect the diff to be:

diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,8 +1,3 @@
-/**
- * Function foo description.
- */
-function foo() {}
-
 /**
  * Function bar description.
  */

In my understanding, unique common lines in this case are the two lines mentioning bar, and the longest common context around those lines should be the function bar() along with its docblock, which means that the diff should boil down to the removed function foo() together with its own docblock and the following blank line.

431

asked Oct 19 '16 13:10

Leon

2 Answers

No one else has addressed this for a while, so I'll take a stab at it. This is all purely high-level theoretical though, since I have not read the paper on the original patience algorithm.

The LCS (longest common subsequence) algorithms are all about reducing the time spent finding a minimal edit distance solution. The standard (dynamic programming) solution is O(MN) where M is the number of symbols in the original string and N is the number of symbols in the target string. In our case, the "symbols" are lines, and the "string" is the collection of lines, rather than strings with characters (where the symbols would be, e.g., ASCII codes). We simply fill in an M x N matrix of "edit costs"; when we're done, we produce the actual edit by tracing a minimal path backwards through the resulting matrix. See https://jlordiales.me/2014/03/01/dynamic-programming-edit-distance/ for an example. (Web page found via Google search: it's not something I had anything to do with, other than to scan it at high speed for correctness now. It seems correct. :-) )

Actually computing this matrix is fairly expensive for big files, since M and N are the number of source lines (usually approximately equal): a ~4k line file results in ~16M entries in the matrix, which must be filled in completely before we can trace the minimal path back. Moreover, comparing "symbols" is no longer as trivial as comparing characters, since each "symbol" is a complete line. (The usual trick is to hash each line and compare hashes instead during the matrix generation, and then re-check during traceback, replacing "keep unchanged symbol" with "delete original and insert new" if the hash misled us. This works fine even in the presence of hash collisions: we may get a very slightly sub-optimal edit sequence, but it will practically never be awful.)

LCS modifies the matrix-calculation by observing that keeping long common subsequences ("preserve all these lines") almost always results in a big win. Having found some good LCS-es, we break the problem into "edit the non-common prefix, keep the common sequence, and edit the non-common suffix": now we calculate two dynamic programming matrices, but for smaller problems, so it goes faster. (And, of course, we can recurse on the prefix and suffix. If we had a ~4k-line file and we found ~2k totally-unchanged, in-common lines near the middle, leaving ~0.5k lines at the top and ~1.5k at the bottom, we can check for long common sub-sequences in the ~0.5k "top has a difference" lines, and then again in the ~1.5k "bottom has a difference" lines.)

LCS does poorly, and thus results in terrible diffs, when the "common subsequences" are trivial lines like }, that have lots of matches but are not really relevant. The patience diff variant simply discards these lines from the initial LCS calculation, so that they're not part of a "common subsequence". This makes the remaining matrices larger, which is why you must be patient. :-)

The result is that patience diff is no help here because our problem has nothing to do with common subsequences. In fact, even if we discarded LCSes entirely and just did one big matrix, we'd still get an undesirable result. Our problem is that the cost of deleting:

- * Function foo description.
- */
-function foo() {}
-
-/**

(and inserting nothing) is the same as the cost of deleting:

-/**
- * Function foo description.
- */
-function foo() {}
-

The cost of either one is just "delete 5 symbols". Even if we weight each symbol—make non-empty lines "more expensive" to delete than empty lines—the cost remains the same: we're deleting the same five lines, in the end.

What we need, instead, is some way to weight lines based on "visual clustering": short lines at an edge are cheaper to delete than short lines in the middle. The compaction heuristic added to Git 2.9 attempts to do this after the fact. It's obviously at least slightly flawed (only blank lines count, and they have to actually exist, not just be implied by reaching an edge). It might be better to do the weighting during matrix-filling (assuming what's left, after doing LCS elimination, really is going through the full dynamic programming matrix). This is nontrivial, though.

190

answered Nov 13 '22 22:11

torek

I found a newer post by Bram Cohen, with his description of the patience diff algorithm that supports the observed output of git diff:

... how Patience Diff works -

Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.

Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.

Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.

Do steps 1-2 on each section between matched lines

So the algorithm's emphasis on unique lines is undermined by the steps 1. and 2. which detect the common prefix and suffix even if they are made of noisy lines.

Such a formulation is a little different from what I've seen before that, and Bram admits that he has slightly changed it:

I've previously described it with the ordering a bit different ...

My question actually repeated the concern expressed in this comment to Bram's post.

answered Nov 13 '22 22:11

Leon

Related questions
                            
                                Git rebase: Combine non-subsequent commits
                            
                                Dummy questions about setting up git on amazon cloud ec2
                            
                                github and composer. pulling repository branches
                            
                                Git listing non-existent remotes
                            
                                Head branch unknown
                            
                                Resolving Git conflicts in SourceTree
                            
                                Git clone with password @
                            
                                Running tests on feature branches
                            
                                Add git submodule as reference to existing solution in VS
                            
                                Your branch and 'origin/master' have diverged, All conflicts fixed but you are still merging
                            
                                Gradle: How to perform git pull through gradle?
                            
                                Clone a repo with submodules: override credentials
                            
                                Git didn't recognize rename of a file
                            
                                How do you keep secrets.yml secret in rails?
                            
                                SVN-Git migration: clean-git branch error
                            
                                Adding an existing android studio project to gitlab repository
                            
                                How do you sync existing code existing repository using git
                            
                                What happens when you git push but someone else has already committed something you don't have on your local repo?
                            
                                Why PS1 is not updated with the git branch name?
                            
                                git - How to ignore only one file without extension?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is git's implementation of the patience diff algorithm correct?

Tags:

git

git-diff

Leon

People also ask

2 Answers

torek

Leon

Recent Activity

Donate For Us