I'm trying to align strings in PHP using Levenshtein distance algorithm. The problem is that my back tracing code does not work properly for all cases. For example when the second array has inserted lines at the beginning. Then the back tracing will only go as far as when i=0.
How to properly implement back tracing for Levenshtein distance?
Levenshtein distance, $s and $t are arrays of strings (rows)
function match_rows($s, $t)
{
$m = count($s);
$n = count($t);
for($i = 0; $i <= $m; $i++) $d[$i][0] = $i;
for($j = 0; $j <= $n; $j++) $d[0][$j] = $j;
for($i = 1; $i <= $m; $i++)
{
for($j = 1; $j <= $n; $j++)
{
if($s[$i-1] == $t[$j-1])
{
$d[$i][$j] = $d[$i-1][$j-1];
}
else
{
$d[$i][$j] = min($d[$i-1][$j], $d[$i][$j-1], $d[$i-1][$j-1]) + 1;
}
}
}
// backtrace
$i = $m;
$j = $n;
while($i > 0 && $j > 0)
{
$min = min($d[$i-1][$j], $d[$i][$j-1], $d[$i-1][$j-1]);
switch($min)
{
// equal or substitution
case($d[$i-1][$j-1]):
if($d[$i][$j] != $d[$i-1][$j-1])
{
// substitution
$sub['i'][] = $i;
$sub['j'][] = $j;
}
$i = $i - 1;
$j = $j - 1;
break;
// insertion
case($d[$i][$j-1]):
$ins[] = $j;
$j = $j - 1;
break;
// deletion
case($d[$i-1][$j]):
$del[] = $i;
$i = $i - 1;
break;
}
}
The levenshtein() function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2. By default, PHP gives each operation (replace, insert, and delete) equal weight.
The Hamming distance is the number of positions at which the corresponding symbols in the two strings are different. The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).
The Levenshtein distance is usually calculated by preparing a matrix of size (M+1)x(N+1) —where M and N are the lengths of the 2 words—and looping through said matrix using 2 for loops, performing some calculations within each iteration.
The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
This is not to be nit-picky, but to help you find the answers you want (and improve your implementation).
The algorithm you are using is the Wagner-Fischer algorithm, not the Levenshtein algorithm. Also, Levenshtein distance is not use to align strings. It is strictly a distance measurement.
There are two types of alignment: global and local. Global alignment is used to minimize the distance between two entire strings. Example: global align "RACE" on "REACH", you get "RxACx". The x's are gaps.
In general, this is the Needleman-Wunsch algorithm, which is very similar to the Wagner-Fischer algorithm. Local alignment finds a substring in a long string and minimizes the difference between a short string and a the substring of the long string.
Example: local align "BELL" on "UMBRELLA" and you get "BxELL" aligned on "BRELL". It is the Smith-Waterman algorithm which, again, is very similar to the Wagner-Fischer algorithm.
I hope that this is helpful in allowing you to better define the exact kind of alignment you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With