Given 2 strings s
and t
. I need to find for each substring in s
edit distance(Levenshtein distance) to t
. Actually I need to know for each i
position in s
what is the minimum edit distance for all substrings started at position i
.
For example:
t = "ab"
s = "sdabcb"
And I need to get something like:
{2,1,0,2,2}
Explanation:
1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2
2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1
3th position:
distance("ab", "ab") = 0
...
minimum is 0
and so on.
I can use brute force algorithm to solve this task, of course. But is there faster algorithm?
Thanks for help.
Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.
In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.
It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ). The smaller the Levenshtein distance, the more similar the strings are.
The maximum edit distance between any two strings (even two identical ones) is infinity, unless you add some kind of restrictions on repetitions of edits.
To find substrings in a given string is very easy. You take the normal Levenshtein algorithm and modify it slightly.
FIRST: Instead of filling the first row of the matrix with 0,1,2,3,4,5,... you fill it entirely with zeros. (green rectangle)
SECOND: Then you run the algorithm.
THIRD: Instead of returning the last cell of the last row you search for the smallest value in the last row and return it. (red rectangle)
Example: needle: "aba", haystack: "c abba c" --> result = 1 (converting abba -> aba)
I tested it and it works.
This is much faster than your suggestion of stepping character by character through the string as you do in your question. You only create the matrix once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With