Fast fuzzy/approximate search in dictionary of strings in Ruby

Question

I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search.

My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so

For this, I first tried to compute all the Levenshtein distances and take the minimum and it was obviously horribly slow. So I tried to implement a Levenshtein Trie based on this article http://stevehanov.ca/blog/index.php?id=114

See my gist here for reproducing the benchmark: https://gist.github.com/nicolasmeunier/7493947

Here are a few benchmarks I got on my machine:

Edit distance of 0 (perfect match)

Benchmark.measure { 10.times { dictionary.search(random_word, 0) } }
<Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801>

*Edit distance of 2, it becomes a LOT slower *

Benchmark.measure { 10.times { dictionary.search(random_word, 2) } }
<Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006>

And it goes downhill from there and become extremely slow for edit distance larger than 2. (1+ second on average per tested string).

I would like to know how/if I could speed this up significantly. If there are existing solutions already implemented in ruby/gems, I also don't want to reinvent the wheel...

EDIT 1: In my case, I expect most of the strings I am matching against the dictionary NOT to be in there. So if there are any algorithm to quickly discard a string, that could really help.

Thanks, Nicolas

mezis · Accepted Answer

I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

first ask either gem for a set of best matches trigrams-wise
then compare results with your input string, using Levenstein
and return the min for that measure.

In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

MyRecords.find_by_fuzzy_name(input_string).select { |result|
  Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

This performas a handful of well optimized database queries and a few

Caveats:

if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
using trigrams assumes your input text is latin text or close to (european languages basically).
there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".

olegarch · Answer

Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".

Basic idea: To split strings by trigrams, and search maximal intersection scores.

For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

Thereafter, for each trigram built index, in which term it is present.

So, this is list of term_ID for each trigram.

When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz

UPDATE:

I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz

Make some directory, and extract archive here:

mkdir F2
cd F2
tar xvfz wilbur-khovayko.tgz
make

Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:

 cd test/
 cp /tmp/test/termlist.txt ./termlist.txt
 ./crefdb.exe <termlist.txt

In this test, I used ~380,000 expired domain names:

wc -l termlist.txt
379430 termlist.txt

Run findtest application:

./findtest.exe

boking  <-- this is query -- word "booking" with misspeling


0001:Query: [boking]
  1:  287890 (  3.863739) [bokintheusa.com,2009-11-20,$69]
  2:  287906 (  3.569148) [bookingseu.com,2009-11-20,$69]
  3:  257170 (  3.565942) [bokitko.com,2009-11-18,$69]
  4:  302830 (  3.413791) [bookingcenters.com,2009-11-21,$69]
  5:  274658 (  3.408325) [bookingsadept.com,2009-11-19,$69]
  6:  100438 (  3.379371) [bookingresorts.com,2009-11-09,$69]
  7:  203401 (  3.363858) [bookinginternet.com,2009-11-15,$69]
  8:  221222 (  3.361689) [bobokiosk.com,2009-11-16,$69]
  . . . . 
 97:   29035 (  2.169753) [buccupbooking.com,2009-11-05,$69]
 98:  185692 (  2.169047) [box-hosting.net,2009-11-14,$69]
 99:  345394 (  2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
100:  150134 (  2.167372) [bowlingbrain.com,2009-11-12,$69]

Fast fuzzy/approximate search in dictionary of strings in Ruby

Tags:

performance

algorithm

ruby

levenshtein-distance

fuzzy-search

Nicolas M.

2 Answers

mezis

olegarch

Recent Activity

Donate For Us

Fast fuzzy/approximate search in dictionary of strings in Ruby

Tags:

performance

algorithm

ruby

levenshtein-distance

fuzzy-search

Nicolas M.

2 Answers

mezis

olegarch

Related questions

Recent Activity

Donate For Us