Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.

So let's say my query is elepant, then the result would most likely be elephant.

If my word is fentist, the result will probably be dentist.

Of course assuming both elephant and dentist are present in my initial word list.

What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).

What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.

like image 686
Chris Vilches Avatar asked Dec 27 '18 19:12

Chris Vilches


People also ask

How do you calculate string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

Which of the following algorithms are used to measure similarity between two strings?

Jaro Similarity is the measure of similarity between two strings. The value of Jaro distance ranges from 0 to 1.

How do you know if two strings are similar?

The equals() method compares two strings, and returns true if the strings are equal, and false if not. Tip: Use the compareTo() method to compare two strings lexicographically.


1 Answers

The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.

If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.

If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.

(You can compute the Levenshtein distance with Hirschberg's algorithm)

like image 83
Joshua Avatar answered Sep 29 '22 05:09

Joshua