Fuzzy search algorithm (approximate string matching algorithm)

Tags:

I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling.

I want to create an algorithm that performs a fuzzy search on a list of names of schools.

This is what I have looked at so far:

Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:

Levenshtein distance
Damerau-Levenshtein distance
Needleman–Wunsch algorithm

However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)

Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.

Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.

What I need?

Can someone please suggest me a good efficient fuzzy search algorithm. with:

Name of the algorithm
How it works or a link to how it works
Pro's and cons and when it's best used (optional)

I understand that all algorithms will have their pros and cons and there is no best algorithm.

743

asked Sep 01 '15 16:09

Yahya Uddin

1 Answers

Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.

Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.

Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:

smythe -> smithe -> smith

And from "smote" to "smith" in two steps:

smote -> smite -> smith

So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.

What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.

Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.

Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.

Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.

114

answered Sep 22 '22 15:09

Jim Mischel

Related questions
                            
                                Pandas column access w/column names containing spaces
                            
                                Counting the number of occurrences of a substring within a string in PostgreSQL
                            
                                I have a string whose content is a function name, how to refer to the corresponding function in Python?
                            
                                String replaceAll() vs. Matcher replaceAll() (Performance differences)
                            
                                Handlebarsjs check if a string is equal to a value
                            
                                Why does the string Remove() method allow a char as a parameter?
                            
                                how to check if given c++ string or char* contains only digits?
                            
                                How can I add " character to a multi line string declaration in C#?
                            
                                Converting a sentence string to a string array of words in Java
                            
                                In Haskell, how do you trim whitespace from the beginning and end of a string?
                            
                                How to find and replace string?
                            
                                How can I use enum in C# for storing string constants? [duplicate]
                            
                                C# Version Of SQL LIKE
                            
                                How to change 1 char in the string?
                            
                                Count the number of lines in a Java String
                            
                                Remove the last chars of the Java String variable
                            
                                Capitalize a string
                            
                                Fast way to concatenate strings in nodeJS/JavaScript [duplicate]
                            
                                Comparing STL strings that use different allocators
                            
                                Difference between toLocaleLowerCase() and toLowerCase() [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy search algorithm (approximate string matching algorithm)

Tags:

string

algorithm

search

levenshtein-distance

fuzzy-search

Yahya Uddin

People also ask

1 Answers

Jim Mischel

Recent Activity

Donate For Us