Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Levenshtein distance with back tracing in PHP

I'm trying to align strings in PHP using Levenshtein distance algorithm. The problem is that my back tracing code does not work properly for all cases. For example when the second array has inserted lines at the beginning. Then the back tracing will only go as far as when i=0.

How to properly implement back tracing for Levenshtein distance?

Levenshtein distance, $s and $t are arrays of strings (rows)

function match_rows($s, $t)
{
$m = count($s);
$n = count($t);

for($i = 0; $i <= $m; $i++) $d[$i][0] = $i;
for($j = 0; $j <= $n; $j++) $d[0][$j] = $j;

for($i = 1; $i <= $m; $i++)
{
    for($j = 1; $j <= $n; $j++)
    {
        if($s[$i-1] == $t[$j-1])
        {
            $d[$i][$j] = $d[$i-1][$j-1];
        }
        else
        {
            $d[$i][$j] = min($d[$i-1][$j], $d[$i][$j-1], $d[$i-1][$j-1]) + 1;
        }
    }
}

// backtrace
$i = $m;
$j = $n;
while($i > 0 && $j > 0)
{
    $min = min($d[$i-1][$j], $d[$i][$j-1], $d[$i-1][$j-1]);

    switch($min)
    {
        // equal or substitution
        case($d[$i-1][$j-1]):
            if($d[$i][$j] != $d[$i-1][$j-1])
            {
                // substitution
                $sub['i'][] = $i;
                $sub['j'][] = $j;
            }
            $i = $i - 1;
            $j = $j - 1;
            break;

        // insertion
        case($d[$i][$j-1]):
            $ins[] = $j;
            $j = $j - 1;
            break;

        // deletion
        case($d[$i-1][$j]):
            $del[] = $i;
            $i = $i - 1;
            break;
    }
}
like image 812
mixxu Avatar asked Oct 18 '12 17:10

mixxu


People also ask

How to use Levenshtein function in PHP?

The levenshtein() function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2. By default, PHP gives each operation (replace, insert, and delete) equal weight.

What is the difference between Hamming distance and Levenshtein distance?

The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

How is Levenshtein calculated?

The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.

Is Levenshtein an algorithm?

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.


1 Answers

This is not to be nit-picky, but to help you find the answers you want (and improve your implementation).

The algorithm you are using is the Wagner-Fischer algorithm, not the Levenshtein algorithm. Also, Levenshtein distance is not use to align strings. It is strictly a distance measurement.

There are two types of alignment: global and local. Global alignment is used to minimize the distance between two entire strings. Example: global align "RACE" on "REACH", you get "RxACx". The x's are gaps.

In general, this is the Needleman-Wunsch algorithm, which is very similar to the Wagner-Fischer algorithm. Local alignment finds a substring in a long string and minimizes the difference between a short string and a the substring of the long string.

Example: local align "BELL" on "UMBRELLA" and you get "BxELL" aligned on "BRELL". It is the Smith-Waterman algorithm which, again, is very similar to the Wagner-Fischer algorithm.

I hope that this is helpful in allowing you to better define the exact kind of alignment you want.

like image 56
kainaw Avatar answered Oct 16 '22 18:10

kainaw