Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms for "fuzzy matching" strings

Tags:

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

like image 684
Alexey Romanov Avatar asked May 23 '10 11:05

Alexey Romanov


People also ask

Which is the best string matching algorithm?

Results: The Boyer-Moore-Horspool algorithm achieves the best overall results when used with medical texts. This algorithm usually performs at least twice as fast as the other algorithms tested. Conclusion: The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used.

How does fuzzy matching algorithm work?

Fuzzy matching is a technique used in computer-assisted translation as a special case of record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations.

Which are the string matching algorithms?

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text.

Which algorithm is used for matching?

Hashing-string matching algorithms: Rabin Karp Algorithm: It matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.


1 Answers

I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)

I think it would be useful to state the problem and the requirements more clearly.

Problem:

We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.

  1. It is expected that all the letters of the input be in the keyword
  2. It is expected that the letters in the input be in the same order in the keyword
  3. The list of keywords returned should be presented in a consistent (reproductible) order
  4. The algorithm should be case insensitive

Analysis:

The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*

The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.

Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.

There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.

My solution:

For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.

I would then match my (alphabetically sorted) list of keywords, and output it so filtered.

In pseudo-code:

WORDS = ['Bar', 'Foo', 'FooBar', 'Other']  def GetList(input, words = WORDS):   expr = ['[^' + i + ']*' + i for i in input]   return [w for w in words if re.match(expr, w, re.IGNORECASE)] 

I could have used a one-liner but thought it would obscure the code ;)

This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:

  • Either there are few characters, thus the matching is quick and the length of the list does not matter much
  • Either there are a lots of characters, and this means we are filtering a short list, thus it does not matter too much if the matching takes a bit longer element-wise

I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

like image 72
Matthieu M. Avatar answered Sep 30 '22 12:09

Matthieu M.