Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for retrieving strings that are close by Levenshtein distance

For example, starting with the set of english words, is there a structure/algorithm that allows one fast retrieval of strings such as "light" and "tight", using the word "right" as the query? I.e., I want to retrieve strings with small Levenshtein distance to the query string.

like image 409
MaiaVictor Avatar asked Feb 13 '13 02:02

MaiaVictor


People also ask

How do you read Levenshtein distance?

The Levenshtein distance is a number that tells you how different two strings are. The higher the number, the more different the two strings are. For example, the Levenshtein distance between “kitten” and “sitting” is 3 since, at a minimum, 3 edits are required to change one into the other.

What are three major operations of levenshtein edit distance?

Most commonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance is sometimes known as Levenshtein distance .

What will be the Levenshtein distance between two strings room and Roomroom?

The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.

How do you normalize Levenshtein distance?

To quantify the similarity, we normalize the edit distance. One approach is to calculate edit distance as usual and then divide it by the number of operations, usually called the length of the edit path. This is called edit distance with post-normalization.


2 Answers

The BK-tree data structure might be appropriate here. It's designed to efficiently support queries of the form "what are all words within edit distance k or less from a query word?" Its performance guarantees are reasonably good, and it's not too difficult to implement.

Hope this helps!

like image 122
templatetypedef Avatar answered Sep 19 '22 06:09

templatetypedef


Since calculating Levenshtein distance is O(nm) for strings of length n and m, the naive approach of calculating all Levenshtein distances L(querystring, otherstring) is very expensive.

However, if you visualize the Levenshtein algorithm, it basically fills an n*m table with edit distances. But for words that start with the same few letters (prefix), the first few rows of the Levenshtein tables will be the same. (Fixing the query string, of course.)

This suggests using a trie (also called prefix tree): Read the query string, then build a trie of Levenshtein rows. Afterwards, you can easily traverse it to find strings close to the query string.

(This does mean that you have to build an new trie for a new query string. I don't think there is a similarly intriguing structure for all-pairs distances.)

I thought I recently saw an article about this with a nice python implementation. Will add a link if I can find it. Edit: Here it is, on Steve Hanov's blog.

like image 25
us2012 Avatar answered Sep 22 '22 06:09

us2012