I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search.
My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so
For this, I first tried to compute all the Levenshtein distances and take the minimum and it was obviously horribly slow. So I tried to implement a Levenshtein Trie based on this article http://stevehanov.ca/blog/index.php?id=114
See my gist here for reproducing the benchmark: https://gist.github.com/nicolasmeunier/7493947
Here are a few benchmarks I got on my machine:
Edit distance of 0 (perfect match)
Benchmark.measure { 10.times { dictionary.search(random_word, 0) } }
<Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801>
*Edit distance of 2, it becomes a LOT slower *
Benchmark.measure { 10.times { dictionary.search(random_word, 2) } }
<Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006>
And it goes downhill from there and become extremely slow for edit distance larger than 2. (1+ second on average per tested string).
I would like to know how/if I could speed this up significantly. If there are existing solutions already implemented in ruby/gems, I also don't want to reinvent the wheel...
EDIT 1: In my case, I expect most of the strings I am matching against the dictionary NOT to be in there. So if there are any algorithm to quickly discard a string, that could really help.
Thanks, Nicolas
I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.
Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:
In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:
MyRecords.find_by_fuzzy_name(input_string).select { |result|
Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}
This performas a handful of well optimized database queries and a few
Caveats:
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz
UPDATE:
I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz
Make some directory, and extract archive here:
mkdir F2
cd F2
tar xvfz wilbur-khovayko.tgz
make
Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:
cd test/
cp /tmp/test/termlist.txt ./termlist.txt
./crefdb.exe <termlist.txt
In this test, I used ~380,000 expired domain names:
wc -l termlist.txt
379430 termlist.txt
Run findtest application:
./findtest.exe
boking <-- this is query -- word "booking" with misspeling
0001:Query: [boking]
1: 287890 ( 3.863739) [bokintheusa.com,2009-11-20,$69]
2: 287906 ( 3.569148) [bookingseu.com,2009-11-20,$69]
3: 257170 ( 3.565942) [bokitko.com,2009-11-18,$69]
4: 302830 ( 3.413791) [bookingcenters.com,2009-11-21,$69]
5: 274658 ( 3.408325) [bookingsadept.com,2009-11-19,$69]
6: 100438 ( 3.379371) [bookingresorts.com,2009-11-09,$69]
7: 203401 ( 3.363858) [bookinginternet.com,2009-11-15,$69]
8: 221222 ( 3.361689) [bobokiosk.com,2009-11-16,$69]
. . . .
97: 29035 ( 2.169753) [buccupbooking.com,2009-11-05,$69]
98: 185692 ( 2.169047) [box-hosting.net,2009-11-14,$69]
99: 345394 ( 2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
100: 150134 ( 2.167372) [bowlingbrain.com,2009-11-12,$69]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With