Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?

I'm implementing search suggestion functionality in my web-app, and have been looking at existing implementations for techniques in use.

It seems as though most of the major sites (Amazon, Bing, etc.) implement fuzzy search in the following way:

Tokenize search string in to terms
processingSearchStringSet = {}
For each term
    if exact term is NOT in index
        Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
        For each term in fuzzyTerms
            if term is in index
                processingSearchStringSet.intersect(stringsIndexedByTermsSet)
    else
        processingSearchStringSet.intersect(stringsIndexedByTermsSet)

The result set members are then presumably ranked by metrics (ex: term order preserval, absolute term location, search popularity) and preserved or eliminated based on this ranking and a pre-determined result set size before being delivered back to the user.

Google's implementation on the other hand, differs quite a bit from this.

Specifically, it allows more than 1 error in the search string's constituent terms. The error threshhold seems to be dependant on where the term of interest is in the string, though it never exceeds 7.

What's interesting is that:

  1. Conducting a Levenstein search with a threshold of 5 on the entire term space, for each term in the user's string would be insanely expensive
  2. Even if #1 is what is done, it still wouldn't explain the absence of erroneous suggestions

N-grams also don't see to be in use: modifying a term so that it doesn't contain an bigram present in the original term does not seem to affect the result(s).

Here's an example to illustrate my findings:

Example term: "Fiftyyyy shades of grey"

Amazon suggestions: none 
(if the error count exceeds 1 on any term, the search fails)

Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)

Google suggestions: 10 (max) 
(breaking the search would require 5 or more errors on any single term, 
or multiple errors on multiple terms)

My question is: what type of sorcery is at work here? Are they just using a Levenshtein search with a huge error allowance, or do they use another technique I am unaware of?

like image 286
Kevin Avatar asked Sep 02 '12 19:09

Kevin


1 Answers

Maybe you should try this approach: http://norvig.com/spell-correct.html

like image 108
f13o Avatar answered Oct 13 '22 15:10

f13o