So I'm currently working for with using SecondString for fuzzy string matching, where I have a large dictionary to compare to (with each entry in the dictionary has an associated non-unique identifier). I am currently using a hashMap to store this dictionary.
When I want to do fuzzy string matching, I first check to see if the string is in the hashMap and then I iterate through all of the other potential keys, calculating the string similarity and storing the k,v pair/s with the highest similarity. Depending on which dictionary I am using this can take a long time ( 12330 - 1800035 entries ). Is there any way to speed this up or make it faster? I am currently writing a memoization function/table as a way of speeding this up, but can anyone else think of a better way to improve the speed of this? Maybe a different structure or something else I'm missing.
Many thanks in advance,
Nathan
DataMatch Enterprise is a highly visual and intuitive fuzzy matching tool, that automates the entire fuzzy matching process, relieving you of the manual effort and labor required to match data fields.
Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.
Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. The process has various applications such as spell-checking, DNA analysis and detection, spam detection, plagiarism detection e.t.c. Introduction to Fuzzywuzzy in Python.
In computer science, fuzzy string matching is the technique of finding strings that match a pattern approximately (rather than exactly). In another word, fuzzy string matching is a type of search that will find matches even when users misspell words or enter only partial words for the search.
What your looking for is a BKTree (BK-Tree) combined with the Levenshtein Distance algorithm. The lookup performance in a BKtree depends on how "Fuzzy" your search is. Where fuzzy is defined as the number of distance (edits) between the search word and the matches.
Here is a good blog on the subject: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
Some notes on the performance: http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html
Notes on the http://en.wikipedia.org/wiki/Levenshtein_distance algorithm.
Also, here is a BK-Tree written in Java. Should give you an idea of the interface: http://code.google.com/p/java-bk-tree/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With