Improving performance of fuzzy string matching against a dictionary

Tags:

So I'm currently working for with using SecondString for fuzzy string matching, where I have a large dictionary to compare to (with each entry in the dictionary has an associated non-unique identifier). I am currently using a hashMap to store this dictionary.

When I want to do fuzzy string matching, I first check to see if the string is in the hashMap and then I iterate through all of the other potential keys, calculating the string similarity and storing the k,v pair/s with the highest similarity. Depending on which dictionary I am using this can take a long time ( 12330 - 1800035 entries ). Is there any way to speed this up or make it faster? I am currently writing a memoization function/table as a way of speeding this up, but can anyone else think of a better way to improve the speed of this? Maybe a different structure or something else I'm missing.

Many thanks in advance,

Nathan

715

asked Feb 09 '11 13:02

Nathan Harmston

1 Answers

What your looking for is a BKTree (BK-Tree) combined with the Levenshtein Distance algorithm. The lookup performance in a BKtree depends on how "Fuzzy" your search is. Where fuzzy is defined as the number of distance (edits) between the search word and the matches.

Here is a good blog on the subject: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

Some notes on the performance: http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html

Notes on the http://en.wikipedia.org/wiki/Levenshtein_distance algorithm.

Also, here is a BK-Tree written in Java. Should give you an idea of the interface: http://code.google.com/p/java-bk-tree/

184

answered Nov 15 '22 17:11

eSniff

Related questions
                            
                                Android: Accessing single database from multiple activities in application?
                            
                                Java Charset problem on linux
                            
                                Equivalent of the DLR for the JVM?
                            
                                Can I throttle requests made by a distributed app?
                            
                                How do I manipulate a tree of immutable objects?
                            
                                Unit Test in Maven requires access to the src/main/webapp Directory
                            
                                Finding Last Fired time using a Cron Expression in Java
                            
                                Java: thread-safe RandomAccessFile
                            
                                What is this field-by-field copy done by Object.clone()?
                            
                                Accessing Request object from custom JSP tags
                            
                                Android: Add two text views programmatically
                            
                                Java equivalent of .NET Action<T> and Func<T,U>, etc [duplicate]
                            
                                Java Convert Generic LinkedList to Generic Array
                            
                                Serve static content in a web server and dynamic content in tomcat is still a good performance practice?
                            
                                Ant string functions?
                            
                                Can I automatically refactor an entire java project and rename uppercase method parameters to lowercase?
                            
                                JPA/Metamodel: Strange (inconsistent ?) example in Sun Docs
                            
                                Maven Exec Plugin not using system path on windows?
                            
                                (How) can I use ServerSocket to listen for UDP instead of TCP traffic?
                            
                                JSF managed bean causing java.io.NotSerializableException during Tomcat deployment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving performance of fuzzy string matching against a dictionary

Tags:

java

data-structures

Nathan Harmston

People also ask

1 Answers

eSniff

Recent Activity

Donate For Us