Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to match strings with typo

I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo.

Example

User types "chcago" and the system finds "Chicago"

Of course I could calculate the Levenshtein distance of the query for all strings in the list but that would be horribly slow.

Is there any efficient way to perform this kind of string-matching?

like image 714
user2033412 Avatar asked Oct 31 '15 13:10

user2033412


2 Answers

I think the basic idea is to use Levenshtein distance, but on a subset of the names. One approach that works if the names are long enough is to use n-grams. You can store n-grams and then use more efficient techniques to say that at least x n-grams need to match. Alas, your example misspelling has 2-matching 3-grams with Chicago out of 5 (unless you count partials at the beginning and end).

For shorter names, another approach is to store the letters in each name. So, "Chicago" would turn into 6 "tuples": "c", "h", "i", "a", "g", "o". You would do the same for the name entered and then require that 4 or 5 match. This is a fairly simple match operation, so it can go quite fast.

Then, on this reduced set, apply Levenshtein distance to determine what the closest match is.

like image 194
Gordon Linoff Avatar answered Oct 12 '22 19:10

Gordon Linoff


You're asking to determine Levenshtein without using Levenshtein.

You would have to determine how far the words could be deviated before it could be identified, and see if it would be acceptable to apply this less accurate algorithm. For instance, you could lookup commonly switched typed letters and limit it to that. Or apply the first/last letter rule from this paper. You could also assume the first few letters are correct and look up the cities in a sorted list and if you don't find it, apply Levenshtein to the n-1 and n+1 words where n is the location of the last lookup (or some variant of it).

There are several ideas, but I don't think there is a single best solution for what you are asking, without more assumptions.

like image 30
ergonaut Avatar answered Oct 12 '22 19:10

ergonaut