Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast fuzzy/approximate search in dictionary of strings in Ruby

I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search.

My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so

For this, I first tried to compute all the Levenshtein distances and take the minimum and it was obviously horribly slow. So I tried to implement a Levenshtein Trie based on this article http://stevehanov.ca/blog/index.php?id=114

See my gist here for reproducing the benchmark: https://gist.github.com/nicolasmeunier/7493947

Here are a few benchmarks I got on my machine:

Edit distance of 0 (perfect match)

Benchmark.measure { 10.times { dictionary.search(random_word, 0) } }
<Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801> 

*Edit distance of 2, it becomes a LOT slower *

Benchmark.measure { 10.times { dictionary.search(random_word, 2) } }
<Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006>

And it goes downhill from there and become extremely slow for edit distance larger than 2. (1+ second on average per tested string).

I would like to know how/if I could speed this up significantly. If there are existing solutions already implemented in ruby/gems, I also don't want to reinvent the wheel...

EDIT 1: In my case, I expect most of the strings I am matching against the dictionary NOT to be in there. So if there are any algorithm to quickly discard a string, that could really help.

Thanks, Nicolas

like image 370
Nicolas M. Avatar asked Nov 16 '13 00:11

Nicolas M.


2 Answers

I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

  • first ask either gem for a set of best matches trigrams-wise
  • then compare results with your input string, using Levenstein
  • and return the min for that measure.

In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

MyRecords.find_by_fuzzy_name(input_string).select { |result|
  Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

This performas a handful of well optimized database queries and a few

Caveats:

  • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
  • using trigrams assumes your input text is latin text or close to (european languages basically).
  • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".
like image 144
mezis Avatar answered Oct 15 '22 17:10

mezis


Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".

Basic idea: To split strings by trigrams, and search maximal intersection scores.

For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

Thereafter, for each trigram built index, in which term it is present.

So, this is list of term_ID for each trigram.

When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz

UPDATE:

I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz

Make some directory, and extract archive here:

mkdir F2
cd F2
tar xvfz wilbur-khovayko.tgz
make

Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:

 cd test/
 cp /tmp/test/termlist.txt ./termlist.txt
 ./crefdb.exe <termlist.txt

In this test, I used ~380,000 expired domain names:

wc -l termlist.txt
379430 termlist.txt

Run findtest application:

./findtest.exe

boking  <-- this is query -- word "booking" with misspeling


0001:Query: [boking]
  1:  287890 (  3.863739) [bokintheusa.com,2009-11-20,$69]
  2:  287906 (  3.569148) [bookingseu.com,2009-11-20,$69]
  3:  257170 (  3.565942) [bokitko.com,2009-11-18,$69]
  4:  302830 (  3.413791) [bookingcenters.com,2009-11-21,$69]
  5:  274658 (  3.408325) [bookingsadept.com,2009-11-19,$69]
  6:  100438 (  3.379371) [bookingresorts.com,2009-11-09,$69]
  7:  203401 (  3.363858) [bookinginternet.com,2009-11-15,$69]
  8:  221222 (  3.361689) [bobokiosk.com,2009-11-16,$69]
  . . . . 
 97:   29035 (  2.169753) [buccupbooking.com,2009-11-05,$69]
 98:  185692 (  2.169047) [box-hosting.net,2009-11-14,$69]
 99:  345394 (  2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
100:  150134 (  2.167372) [bowlingbrain.com,2009-11-12,$69]
like image 5
olegarch Avatar answered Oct 15 '22 16:10

olegarch