I've been reviewing Eugene Myers' Diff Algorithm Paper. This is the algorithm that is implemented in the popular diff
program.
On page 12 of the paper, it presents the pseudo-code for the algorithm to find the longest common sub-sequence of A
and B
:
LCS(A, N, B, M)
If N > 0 and M > 0 Then
Find the middle snake and the length of an optimal path for A and B.
Suppose it is from (x, y) to (u, v).
If D > 1 Then
LCS(A[1..x], x, B[1..y], y)
Output A[x+1..u]
LCS(A[u+1..N], N-u, B[v+1..M], M-v)
Else If M > N Then
Output A[1..N].
Else
Output B[1..M].
Suppose A = "A" and B = "B". In this case, N = 1 and M = 1. The middle snake would be (x, y) = (0, 1) and (u, v) = (0, 1) because there are no diagonals. In this case D = 1 because the algorithm has only taken one step.
The algorithm says that the only thing to do in this scenario is to Output B[1..M]
, equal to "B", because N > 0, M > 0, D = 1, and M = N. But this seems wrong, because there is no common sub-sequence between "A" and "B". The paper's commentary that "If D <= 1 then B is obtained from A by either deleting or inserting at most one symbol" is incorrect because "A" must be removed and "B" added.
What am I misinterpreting here?
You are misunderstanding the definition of D-path and snake.
From page 4:
Let a D-path be a path starting at (0,0) that has exactly D non-diagonal edges. A 0-path must consist solely of diagonal edges. By a simple induction, it follows that a D-path must consist of a (D − 1)-path followed by a non-diagonal edge and then a possibly empty sequence of diagonal edges called a snake
So, in your example where A = "A" and B = "B", the optimal path is a 2-path (one horizontal and one vertical), and the middle snake is an empty string. We know from inspection the LCS is an empty string, but we'd like to show the algorithm working to prove it.
First of all we need to find the middle snake. If you follow the algorithm to find the middle snake on page 11, you will see that the length of the shortest edit script is 2 and (x,y) = (u,v) = (1,0) or (0,1). In other words, it is an empty snake in the middle of the path.
The algorithm pseudocode has some non-obvious notational conventions:
So, in this example case, assume the first call to LCS finds a middle snake with (x,y) = (u,v) = (1,0), then since D=2, the result is the expansion of:
LCS(A[1..1], 1, B[1..0], 0) // Output nothing since M = 0
Output A[2..1] // Output nothing since it is an empty string.
LCS(A[2..1], 0, B[1..1], 1) // Output nothing since N = 0
This is correct since these strings have no common sub-sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With