Data structure for retrieving strings that are close by Levenshtein distance

Tags:

For example, starting with the set of english words, is there a structure/algorithm that allows one fast retrieval of strings such as "light" and "tight", using the word "right" as the query? I.e., I want to retrieve strings with small Levenshtein distance to the query string.

409

asked Feb 13 '13 02:02

MaiaVictor

2 Answers

The BK-tree data structure might be appropriate here. It's designed to efficiently support queries of the form "what are all words within edit distance k or less from a query word?" Its performance guarantees are reasonably good, and it's not too difficult to implement.

Hope this helps!

122

answered Sep 19 '22 06:09

templatetypedef

Since calculating Levenshtein distance is O(nm) for strings of length n and m, the naive approach of calculating all Levenshtein distances L(querystring, otherstring) is very expensive.

However, if you visualize the Levenshtein algorithm, it basically fills an n*m table with edit distances. But for words that start with the same few letters (prefix), the first few rows of the Levenshtein tables will be the same. (Fixing the query string, of course.)

This suggests using a trie (also called prefix tree): Read the query string, then build a trie of Levenshtein rows. Afterwards, you can easily traverse it to find strings close to the query string.

(This does mean that you have to build an new trie for a new query string. I don't think there is a similarly intriguing structure for all-pairs distances.)

I thought I recently saw an article about this with a nice python implementation. Will add a link if I can find it. Edit: Here it is, on Steve Hanov's blog.

answered Sep 22 '22 06:09

us2012

Related questions
                            
                                Find digits in file names and cross reference them with others
                            
                                is file readable (contains text rather is accessible )
                            
                                Finding the minimum Hamming distance in less than O(n^2m) time
                            
                                Concatenate two strings in apache config
                            
                                intelligent path truncation/ellipsis for display
                            
                                String.Format not converting integers correctly in arabic
                            
                                Is there any standard which says if "aba".split(/a/) should return 1,2, or 3 elements?
                            
                                Printed length of a string in python
                            
                                whats the fastest string collection structure/algorithm for startswith and/or contains searches
                            
                                Replace every character with an element
                            
                                Is there 'strings' command for utf-8? [closed]
                            
                                convert string input to instance of object name(=input)
                            
                                How to convert InputStream to int [duplicate]
                            
                                HTML to NSAttributedString and NSAttributedString to HTML
                            
                                How to 'raw text' a variable in Python?
                            
                                Set encoding for a nodeJS Transform stream in a safe manner
                            
                                How to define a Hash class for custom std::basic_string<> specialization class just like std::string?
                            
                                Unexpected result for string comparison
                            
                                How do I find the largest sequence in a string that is repeated at least once?
                            
                                Django Form Field for a list of strings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data structure for retrieving strings that are close by Levenshtein distance

Tags:

string

algorithm

data-structures

levenshtein-distance

MaiaVictor

People also ask

2 Answers

templatetypedef

us2012

Recent Activity

Donate For Us