Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eugene Myers' Diff Algorithm: Finding the Longest Common Subsequence of "A" and "B"

Tags:

algorithm

diff

I've been reviewing Eugene Myers' Diff Algorithm Paper. This is the algorithm that is implemented in the popular diff program.

On page 12 of the paper, it presents the pseudo-code for the algorithm to find the longest common sub-sequence of A and B:

LCS(A, N, B, M)
    If N > 0 and M > 0 Then
        Find the middle snake and the length of an optimal path for A and B.
        Suppose it is from (x, y) to (u, v).
        If D > 1 Then
            LCS(A[1..x], x, B[1..y], y)
            Output A[x+1..u]
            LCS(A[u+1..N], N-u, B[v+1..M], M-v)
        Else If M > N Then
            Output A[1..N].
        Else
            Output B[1..M].

Suppose A = "A" and B = "B". In this case, N = 1 and M = 1. The middle snake would be (x, y) = (0, 1) and (u, v) = (0, 1) because there are no diagonals. In this case D = 1 because the algorithm has only taken one step.

The algorithm says that the only thing to do in this scenario is to Output B[1..M], equal to "B", because N > 0, M > 0, D = 1, and M = N. But this seems wrong, because there is no common sub-sequence between "A" and "B". The paper's commentary that "If D <= 1 then B is obtained from A by either deleting or inserting at most one symbol" is incorrect because "A" must be removed and "B" added.

What am I misinterpreting here?

like image 950
Ana Avatar asked Apr 14 '15 20:04

Ana


1 Answers

You are misunderstanding the definition of D-path and snake.

From page 4:

Let a D-path be a path starting at (0,0) that has exactly D non-diagonal edges. A 0-path must consist solely of diagonal edges. By a simple induction, it follows that a D-path must consist of a (D − 1)-path followed by a non-diagonal edge and then a possibly empty sequence of diagonal edges called a snake

So, in your example where A = "A" and B = "B", the optimal path is a 2-path (one horizontal and one vertical), and the middle snake is an empty string. We know from inspection the LCS is an empty string, but we'd like to show the algorithm working to prove it.

First of all we need to find the middle snake. If you follow the algorithm to find the middle snake on page 11, you will see that the length of the shortest edit script is 2 and (x,y) = (u,v) = (1,0) or (0,1). In other words, it is an empty snake in the middle of the path.

The algorithm pseudocode has some non-obvious notational conventions:

  1. A[m..n] is an empty string if n < m.
  2. In the recursive calls to LCS, the first call uses ceiling(D/2) as D and the second call uses floor(D/2). This is made more clear in the text above the algorithm on page 12.

So, in this example case, assume the first call to LCS finds a middle snake with (x,y) = (u,v) = (1,0), then since D=2, the result is the expansion of:

LCS(A[1..1], 1, B[1..0], 0)  // Output nothing since M = 0
Output A[2..1]               // Output nothing since it is an empty string.
LCS(A[2..1], 0, B[1..1], 1)  // Output nothing since N = 0

This is correct since these strings have no common sub-sequence.

like image 186
caveman Avatar answered Oct 08 '22 10:10

caveman