Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving performance of fuzzy string matching against a dictionary

So I'm currently working for with using SecondString for fuzzy string matching, where I have a large dictionary to compare to (with each entry in the dictionary has an associated non-unique identifier). I am currently using a hashMap to store this dictionary.

When I want to do fuzzy string matching, I first check to see if the string is in the hashMap and then I iterate through all of the other potential keys, calculating the string similarity and storing the k,v pair/s with the highest similarity. Depending on which dictionary I am using this can take a long time ( 12330 - 1800035 entries ). Is there any way to speed this up or make it faster? I am currently writing a memoization function/table as a way of speeding this up, but can anyone else think of a better way to improve the speed of this? Maybe a different structure or something else I'm missing.

Many thanks in advance,

Nathan

like image 715
Nathan Harmston Avatar asked Feb 09 '11 13:02

Nathan Harmston


People also ask

What tool would you use to adjust the sensitivity of the fuzzy matching?

DataMatch Enterprise is a highly visual and intuitive fuzzy matching tool, that automates the entire fuzzy matching process, relieving you of the manual effort and labor required to match data fields.

Is fuzzy matching good?

Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.

What is fuzzy matching in python?

Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. The process has various applications such as spell-checking, DNA analysis and detection, spam detection, plagiarism detection e.t.c. Introduction to Fuzzywuzzy in Python.

How does the fuzzy matching work?

In computer science, fuzzy string matching is the technique of finding strings that match a pattern approximately (rather than exactly). In another word, fuzzy string matching is a type of search that will find matches even when users misspell words or enter only partial words for the search.


1 Answers

What your looking for is a BKTree (BK-Tree) combined with the Levenshtein Distance algorithm. The lookup performance in a BKtree depends on how "Fuzzy" your search is. Where fuzzy is defined as the number of distance (edits) between the search word and the matches.

Here is a good blog on the subject: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

Some notes on the performance: http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html

Notes on the http://en.wikipedia.org/wiki/Levenshtein_distance algorithm.

Also, here is a BK-Tree written in Java. Should give you an idea of the interface: http://code.google.com/p/java-bk-tree/

like image 184
eSniff Avatar answered Nov 15 '22 17:11

eSniff