Creating a "spell check" that checks against a database with a reasonable runtime

Tags:

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.

I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?

The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.

For what it's worth, I'm using NHibernate for data access.

I would appreciate any feedback on how I can do this or what my options are.

432

asked Jan 28 '11 22:01

Tyler Treat

2 Answers

Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.

The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

173

answered Sep 21 '22 11:09

Drew Noakes

As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.

Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.

answered Sep 24 '22 11:09

Mr. Putty

Related questions
                            
                                C# - Are Dynamic Parameters Boxed
                            
                                Dependency-injection in real life
                            
                                Why doesn't Java's hashCode support universal hashing?
                            
                                Cross platform git hooks
                            
                                Any Java libraries for drawing ASCII tables?
                            
                                JavaScript catch parameter already defined
                            
                                Why does Java's Collection<E>.toArray() return an Object[] rather than an E[]?
                            
                                xpath 2.0 for java possible
                            
                                Partially match strings in case of List.contains(String)
                            
                                Why is (1 < NaN) false in JavaScript?
                            
                                How to remove dead code from Javascript [duplicate]
                            
                                How to fire mouse wheel event in Firefox with JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With