Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for autocomplete?

I am referring to the algorithm that is used to give query suggestions when a user types a search term in Google.

I am mainly interested in: 1. Most important results (most likely queries rather than anything that matches) 2. Match substrings 3. Fuzzy matches

I know you could use Trie or generalized trie to find matches, but it wouldn't meet the above requirements...

Similar questions asked earlier here

like image 316
StackUnderflow Avatar asked May 25 '10 03:05

StackUnderflow


People also ask

Which data structure is used for autocomplete?

Autocomplete is a feature of suggesting possible extensions to a partially written text and is widely used in search engine, code IDEs and much more. Trie data structure is a perfect fit to implement this feature efficient in terms of memory and time [O(length of string)].

How does Google autocomplete work so fast?

Our automated systems generate predictions that help people save time by allowing them to quickly complete the search they already intended to do. Autocomplete predictions reflect real searches that have been done on Google.

How do you autocomplete?

From the control panel, select the search engine you want to edit. Click Search features from the menu on the left and then click the Autocomplete tab. Click on the slider to set Enable autocomplete to On.


2 Answers

For (heh) awesome fuzzy/partial string matching algorithms, check out Damn Cool Algorithms:

  • http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
  • http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

These don't replace tries, but rather prevent brute-force lookups in tries - which is still a huge win. Next, you probably want a way to bound the size of the trie:

  • keep a trie of recent/top N words used globally;
  • for each user, keep a trie of recent/top N words for that user.

Finally, you want to prevent lookups whenever possible...

  • cache lookup results: if the user clicks through on any search results, you can serve those very quickly and then asynchronously fetch the full partial/fuzzy lookup.
  • precompute lookup results: if the user has typed "appl", they are likely to continue with "apple", "apply".
  • prefetch data: for instance, a web app can send a smaller set of results to the browser, small enough to make brute-force searching in JS viable.
like image 176
fearlesstost Avatar answered Sep 19 '22 11:09

fearlesstost


I'd just like to say... A good solution to this problem is going to incorporate more than a Ternary Search Tree. Ngrams, and Shingles (Phrases) are needed. Word-boundary errors also need to be detected. "hell o" should be "hello" ... and "whitesocks" should be "white socks" - these are pre-processing steps. If you don't preprocess the data properly you aren't going to get valuable search results. Ternary search trees are a useful component in figuring out what is a word, and also for implementing related-word guessing when a word typed isn't a valid word in the index.

The google algorithm performs phrase suggestion and correction. The google algorithm also has some concept of context... if the first word you search for is weather related and you combine them "weatherforcst" vs "monsoonfrcst" vs "deskfrcst" - my guess is behind the scenes rankings are being changed in the suggestion based on the first word encountered - forecast and weather are related words therefore forecast get's a high rank in the Did-You-Mean guess.

word-partials (ngrams), phrase-terms (shingles), word-proximity (word-clustering-index), ternary-search-tree (word lookup).

like image 34
Ben DeMott Avatar answered Sep 20 '22 11:09

Ben DeMott