Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar. So let's say my query is <code>elepant</code>, then the result would most likely be <code>elephant</code>. If my word is <code>fentist</code>, the result will probably be <code>dentist</code>. Of course assuming both <code>elephant</code> and <code>dentist</code> are present in my initial word list. What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of <code>O(log N)</code>. What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.

The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate. If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search. If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods. (You can compute the Levenshtein distance with Hirschberg's algorithm)

Finding the most similar string among a set of millions of strings

Tags:

string

algorithm

search

data-structures

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.

So let's say my query is elepant, then the result would most likely be elephant.

If my word is fentist, the result will probably be dentist.

Of course assuming both elephant and dentist are present in my initial word list.

What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).

What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.

686

asked Dec 27 '18 19:12

Chris Vilches

1 Answers

The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.

If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.

If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.

(You can compute the Levenshtein distance with Hirschberg's algorithm)

answered Sep 29 '22 05:09

Joshua

Related questions
                            
                                Converting byte array to string in C# [duplicate]
                            
                                Is a compile-time checked string-to-int map possible?
                            
                                Longest common substring in two sequences of strings
                            
                                dynamic programming for minimum cost of breaking the string
                            
                                PHP How to get a method name with a class and namespace path as a string?
                            
                                pattern matching on String in Rust Language
                            
                                File Path as Command Line Argument
                            
                                Finding smallest substring not present in string
                            
                                How to preserve data types of a particular key in JSON Object using GSON?
                            
                                How to fix a regex that attemps to catch some word and id?
                            
                                join function of a numpy array composed of string
                            
                                JSONObject remove empty value pairs
                            
                                difference in match due to the position of negative lookahead?
                            
                                String error-checking
                            
                                Apply control characters to a string - Python
                            
                                Intern method concept confusion as output changed with different version
                            
                                Combine Multiple Regex into One
                            
                                Strings Best Practice in Angular 2
                            
                                character string literal and string literal in standard?
                            
                                Chaining multiple positive lookaheads in JavaScript regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With