Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find edit distance to all substrings

Given 2 strings s and t. I need to find for each substring in s edit distance(Levenshtein distance) to t. Actually I need to know for each i position in s what is the minimum edit distance for all substrings started at position i.

For example:

t = "ab"    
s = "sdabcb"

And I need to get something like:

{2,1,0,2,2}

Explanation:

1st position:
distance("ab", "sd") = 4 ( 2*subst )
distance("ab", "sda") = 3( 2*delete + insert )
distance("ab", "sdab") = 2 ( 2 * delete)
distance("ab", "sdabc") = 3 ( 3 * delete)
distance("ab", "sdabcb") = 4 ( 4 * delete)
So, minimum is 2

2nd position:
distance("ab", "da") = 2 (delete + insert)
distance("ab", "dab") = 1 (delete)
distance("ab", "dabc") = 2 (2*delete)
....
So, minimum is 1

3th position:
distance("ab", "ab") = 0
...
minimum is 0

and so on.

I can use brute force algorithm to solve this task, of course. But is there faster algorithm?

Thanks for help.

like image 288
Ivan Bianko Avatar asked Nov 15 '11 16:11

Ivan Bianko


People also ask

How do you calculate edit distance?

Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.

How does edit distance algorithm work?

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

What is the distance algorithm?

It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ). The smaller the Levenshtein distance, the more similar the strings are.

What is the max edit distance?

The maximum edit distance between any two strings (even two identical ones) is infinity, unless you add some kind of restrictions on repetitions of edits.


1 Answers

To find substrings in a given string is very easy. You take the normal Levenshtein algorithm and modify it slightly.

FIRST: Instead of filling the first row of the matrix with 0,1,2,3,4,5,... you fill it entirely with zeros. (green rectangle)

SECOND: Then you run the algorithm.

THIRD: Instead of returning the last cell of the last row you search for the smallest value in the last row and return it. (red rectangle)

Example: needle: "aba", haystack: "c abba c" --> result = 1 (converting abba -> aba)

enter image description here

I tested it and it works.

This is much faster than your suggestion of stepping character by character through the string as you do in your question. You only create the matrix once.

like image 134
Elmue Avatar answered Oct 24 '22 22:10

Elmue