Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein Distance: Inferring the edit operations from the matrix

I wrote Levenshtein algorithm in in C++

If I input:
string s: democrat
string t: republican

I get the matrix D filled-up and the number of operations (the Levenshtein distance) can be read in D[10][8] = 8
Beyond the filled matrix I want to construct the optimal solution. How must look this solution? I don't have an idea.
Please only write me HOW MUST LOOK for this example.

like image 665
borebardha Avatar asked May 01 '11 15:05

borebardha


4 Answers

The question is
Given the matrix produced by the Levenshtein algorithm, how can one find "the optimal solution"?
i.e. how can we find the precise sequence of string operations: inserts, deletes and substitution [of a single letter], necessary to convert the 's string' into the 't string'?

First, it should be noted that in many cases there are SEVERAL optimal solutions.
While the Levenshtein algorithm supplies the minimum number of operations (8 in democrat/republican example) there are many sequences (of 8 operations) which can produce this conversion.

By "decoding" the Levenshtein matrix, one can enumerate ALL such optimal sequences.
The general idea is that the optimal solutions all follow a "path", from top left corner to bottom right corner (or in the other direction), whereby the matrix cell values on this path either remain the same or increase by one (or decrease by one in the reverse direction), starting at 0 and ending at the optimal number of operations for the strings in question (0 thru 8 democrat/republican case). The number increases when an operation is necessary, it stays the same when the letter at corresponding positions in the strings are the same.

It is easy to produce an algorithm which produces such a path (slightly more complicated to produce all possible paths), and from such path deduce the sequence of operations.

This path finding algorithm should start at the lower right corner and work its way backward. The reason for this approach is that we know for a fact that to be an optimal solution it must end in this corner, and to end in this corner, it must have come from one of the 3 cells either immediately to its left, immediately above it or immediately diagonally. By selecting a cell among these three cells, one which satisfies our "same value or decreasing by one" requirement, we effectively pick a cell on one of the optimal paths. By repeating the operation till we get on upper left corner (or indeed until we reach a cell with a 0 value), we effectively backtrack our way on an optimal path.

Illustration with the democrat - republican example

It should also be noted that one can build the matrix in one of two ways: with 'democrat' horizontally or vertically. This doesn't change the computation of the Levenshtein distance nor does it change the list of operations needed; it only changes the way we interpret the matrix, for example moving horizontally on the "path" either means inserting a character [from the t string] or deleting a character [off the s string] depending whether 'string s' is "horizontal" or "vertical" in the matrix.

I'll use the following matrix. The conventions are therefore (only going in the left-to-right and/or top-to-bottom directions)

  • an horizontal move is an INSERTION of a letter from the 't string'
  • an vertical move is a DELETION of a letter from the 's string'
  • a diagonal move is either:
    • a no-operation (both letters at respective positions are the same); the number doesn't change
    • a SUBSTITUTION (letters at respective positions are distinct); the number increase by one.

Levenshtein matrix for s = "democrat", t="republican"

      r  e  p  u  b  l  i  c  a  n
   0  1  2  3  4  5  6  7  8  9  10
d  1  1  2  3  4  5  6  7  8  9  10
e  2  2  1  2  3  4  5  6  7  8  9
m  3  3  2  2  3  4  5  6  7  8  9
o  4  4  3  3  3  4  5  6  7  8  9
c  5  5  4  4  4  4  5  6  6  7  8
r  6  5  5  5  5  5  5  6  7  7  8
a  7  6  6  6  6  6  6  6  7  7  8
t  8  7  7  7  7  7  7  7  7  8  8

The arbitrary approach I use to select one path among several possible optimal paths is loosely described below:

Starting at the bottom-rightmost cell, and working our way backward toward 
the top left.

For each "backward" step, consider the 3 cells directly adjacent to the current
cell   (in the left, top or left+top directions)

   if the value in the diagonal cell (going up+left) is smaller or equal to the
      values found in the other two cells
   AND 
      if this is same or 1 minus the value of the current cell 
   then  "take the diagonal cell"
         if the value of the diagonal cell is one less than the current cell:
            Add a SUBSTITUTION operation (from the letters corresponding to
            the _current_ cell)
         otherwise: do not add an operation this was a no-operation.

   elseif the value in the cell to the left is smaller of equal to the value of
       the of the cell above current cell
   AND 
       if this value is same or 1 minus the value of the current cell 
   then "take the cell to left", and
        add an INSERTION of the letter corresponding to the cell
   else
       take the cell above, add
       Add a DELETION operation of the letter in 's string'

Following this informal pseudo-code, we get the following:

Start on the "n", "t" cell at bottom right.
Pick the [diagonal] "a", "a" cell as next destination since it is less than the other two (and satisfies the same or -1 condition).
Note that the new cell is one less than current cell
therefore the step 8 is substitute "t" with "n":     democra N

Continue with "a", "a" cell,
Pick the [diagonal] "c", "r" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.

Continue with "c", "r" cell,
Pick the [diagonal] "i", "c" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 7 is substitute "r" with "c":     democ C an

Continue with "i", "c" cell,
Pick the [diagonal] "l", "o" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 6 is substitute "c" with "i":     demo I can

Continue with "l", "o" cell,
Pick the [diagonal] "b", "m" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 5 is substitute "o" with "l":     dem L ican

Continue with "b", "m" cell,
Pick the [diagonal]"u", "e" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 4 is substitute "m" with "b":     de B lican

Continue with "u", "e" cell,
Note the "diagonal" cell doesn't qualify, because the "left" cell is less than it. Pick the [left] "p", "e" cell as next destination...
therefore the step 3 is instert "u" after "e":     de U blican

Continue with "p", "e" cell,
again the "diagonal" cell doesn't qualify Pick the [left] "e", "e" cell as next destination...
therefore the step 2 is instert "p" after "e":     de P ublican

Continue with "e", "e" cell,
Pick the [diagonal] "r", "d" cell as next destination...
Note that the new cell is same value as current cell ==> no operation needed.

Continue with "r", "d" cell,
Pick the [diagonal] "start" cell as next destination...
Note that the new cell is one less than current cell
therefore the step 1 is substitute "d" with "r":     R epublican

You've arrived at a cell which value is 0 : your work is done!

like image 154
mjv Avatar answered Nov 15 '22 21:11

mjv


The backtracking algorithm to infer the moves from the matrix implemented in python:

    def _backtrack_string(matrix, output_word):
    '''
    Iteratively backtrack DP matrix to get optimal set of moves

    Inputs: DP matrix (list:list:int),
            Input word (str),
            Output word (str),
            Start x position in DP matrix (int),
            Start y position in DP matrix (int)
    Output: Optimal path (list)
    '''

    i = len(matrix) - 1
    j = len(matrix[0]) - 1
    optimal_path = []
    while i > 0 and j > 0:
        diagonal = matrix[i-1][j-1]
        vertical = matrix[i-1][j]
        horizontal = matrix[i][j-1]
        current = matrix[i][j]
        if diagonal <= vertical and diagonal <= horizontal and (diagonal <= current):
            i = i - 1
            j = j - 1
            if diagonal == current - 1:
                optimal_path.append("Replace " + str(j) + ", " + str(output_word[j]) )
            elif horizontal <= vertical and horizontal <= current:
                j = j - 1
                optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
            elif vertical <= horizontal and vertical <= current:
                i = i - 1
                optimal_path.append("Delete " + str(i))
        elif horizontal <= vertical and horizontal <= current:
            j = j - 1
            optimal_path.append("Insert " + str(j) + ", " + str(output_word[j]))
        else:
            i = i - 1
            optimal_path.append("Delete " + str(i))

    return reversed(optimal_path)

The output I get when I run the algorithm with original word "OPERATING" and desired word "CONSTANTINE" is the following

    Insert 0, C
    Replace 2, N
    Replace 3, S
    Replace 4, T
    Insert 6, N
    Replace 10, E

       ""  C  O  N  S  T  A  N  T  I  N   E

    "" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] 
              <--                              Insert 0, C
    O  [1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9,  10]
                \                              Replace 2, N
    P  [2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9,  10]
                   \                           Replace 3, S
    E  [3, 3, 3, 3, 3, 4, 5, 6, 7, 8, 9,  9]
                      \                        Replace 4, T
    R  [4, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9,  10]  No move
                         \ <--                 Insert 6, N
    A  [5, 5, 5, 5, 5, 5, 4, 5, 6, 7, 8,  9]
                               \               No move
    T  [6, 6, 6, 6, 6, 5, 5, 5, 5, 6, 7,  8]
                                  \            No move
    I  [7, 7, 7, 7, 7, 6, 6, 6, 6, 5, 6,  7]
                                     \         No move
    N  [8, 8, 8, 7, 8, 7, 7, 6, 7, 6, 5,  6]
                                        \      Replace 10, E
    G  [9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 6,  6]

Note that I had to add extra conditions if the element in the diagonal is the same as the current element. There could be a deletion or insertion depending on values in the vertical (up) and horizontal (left) positions. We only get a "no operation" or "replace" operation when the following occurs

# assume bottom right of a 2x2 matrix is the reference position 
# and has value v
# the following is the situation where we get a replace operation
    [v + 1 , v<]
    [  v<  , v]
# the following is the situation where we get a "no operation"
    [v , v<]
    [v<, v ] 

I think this is where the algorithm described in the first answer could break. There could be other arrangements in the 2x2 matrix above when neither operations are correct. The example shown with input "OPERATING" and output "CONSTANTINE" breaks the algorithm unless this is taken into account.

like image 34
Nissal Avatar answered Nov 15 '22 22:11

Nissal


It's been some times since I played with it, but it seems to me the matrix should look something like:

. . r e p u b l i c a n
. 0 1 2 3 4 5 6 7 8 9 10
d 1 1 2 3 4 5 6 7 8 9 10
e 2 2 1 2 3 4 5 6 7 8 9
m 3 3 2 2 3 4 5 6 7 8 9
o 4 4 3 3 3 4 5 6 7 8 9
c 5 5 4 4 4 4 5 6 7 8 9
r 6 5 5 5 5 5 5 6 7 8 9
a 7 6 6 6 6 6 6 6 7 7 8
t 8 7 7 7 7 7 7 7 7 7 8

Don't take it for granted though.

like image 26
Matthieu M. Avatar answered Nov 15 '22 20:11

Matthieu M.


Here is a VBA algorithm based on mjv's answer. (very well explained, but some case were missing).

    Sub TU_Levenshtein()

        Call Levenshtein("democrat", "republican")

        Call Levenshtein("ooo", "u") 

        Call Levenshtein("ceci est un test", "ceci n'est pas un test")

    End Sub

    Sub Levenshtein(ByVal string1 As String, ByVal string2 As String)

        ' Fill Matrix Levenshtein (-> array 'Distance')

        Dim i As Long, j As Long
        Dim string1_length As Long
        Dim string2_length As Long
        Dim distance() As Long

        string1_length = Len(string1)
        string2_length = Len(string2)
        ReDim distance(string1_length, string2_length)

        For i = 0 To string1_length
            distance(i, 0) = i
        Next

        For j = 0 To string2_length
            distance(0, j) = j
        Next

        For i = 1 To string1_length
            For j = 1 To string2_length
                If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
                    distance(i, j) = distance(i - 1, j - 1)
                Else
                    distance(i, j) = Application.WorksheetFunction.min _
                    (distance(i - 1, j) + 1, _
                     distance(i, j - 1) + 1, _
                     distance(i - 1, j - 1) + 1)
                End If
            Next
        Next

        LevenshteinDistance = distance(string1_length, string2_length) ' for information only

        ' Write Matrix on VBA sheets (only for visuation, not used in calculus)

        Cells.Clear

        For i = 1 To UBound(distance, 1)
            Cells(i + 2, 1).Value = Mid(string1, i, 1)
        Next i

        For i = 1 To UBound(distance, 2)
            Cells(1, i + 2).Value = Mid(string2, i, 1)
        Next i

        For i = 0 To UBound(distance, 1)

            For j = 0 To UBound(distance, 2)

                Cells(i + 2, j + 2) = distance(i, j)

            Next j

        Next i

        ' One solution

        current_posx = UBound(distance, 1)
        current_posy = UBound(distance, 2)

        Do

            cc = distance(current_posx, current_posy)

            Cells(current_posx + 1, current_posy + 1).Interior.Color = vbYellow ' visualisation again

            ' Manage border case

            If current_posy - 1 < 0 Then

                MsgBox ("deletion. " & Mid(string1, current_posx, 1))

                current_posx = current_posx - 1
                current_posy = current_posy

                GoTo suivant

            End If

            If current_posx - 1 < 0 Then

                MsgBox ("insertion. " & Mid(string2, current_posy, 1))

                current_posx = current_posx
                current_posy = current_posy - 1

                GoTo suivant

            End If

            ' Middle cases

            cc_L = distance(current_posx, current_posy - 1)
            cc_U = distance(current_posx - 1, current_posy)
            cc_D = distance(current_posx - 1, current_posy - 1)

            If (cc_D <= cc_L And cc_D <= cc_U) And (cc_D = cc - 1 Or cc_D = cc) Then

                If (cc_D = cc - 1) Then

                    MsgBox "substitution. " & Mid(string1, current_posx, 1) & " by " & Mid(string2, current_posy, 1)

                    current_posx = current_posx - 1
                    current_posy = current_posy - 1

                    GoTo suivant

                Else

                    MsgBox "no operation"

                    current_posx = current_posx - 1
                    current_posy = current_posy - 1

                    GoTo suivant

                End If

            ElseIf cc_L <= cc_D And cc_L = cc - 1 Then

                MsgBox ("insertion. " & Mid(string2, current_posy, 1))

                current_posx = current_posx
                current_posy = current_posy - 1

                GoTo suivant

            Else

                MsgBox ("deletion." & Mid(string1, current_posy, 1))

                current_posx = current_posx
                current_posy = current_posy - 1

                GoTo suivant

            End If

    suivant:

        Loop While Not (current_posx = 0 And current_posy = 0)

    End Sub
like image 24
JackIsJack Avatar answered Nov 15 '22 20:11

JackIsJack