I am referring to the algorithm that is used to give query suggestions when a user types a search term in Google.
I am mainly interested in: 1. Most important results (most likely queries rather than anything that matches) 2. Match substrings 3. Fuzzy matches
I know you could use Trie or generalized trie to find matches, but it wouldn't meet the above requirements...
Similar questions asked earlier here
Autocomplete is a feature of suggesting possible extensions to a partially written text and is widely used in search engine, code IDEs and much more. Trie data structure is a perfect fit to implement this feature efficient in terms of memory and time [O(length of string)].
Our automated systems generate predictions that help people save time by allowing them to quickly complete the search they already intended to do. Autocomplete predictions reflect real searches that have been done on Google.
From the control panel, select the search engine you want to edit. Click Search features from the menu on the left and then click the Autocomplete tab. Click on the slider to set Enable autocomplete to On.
For (heh) awesome fuzzy/partial string matching algorithms, check out Damn Cool Algorithms:
These don't replace tries, but rather prevent brute-force lookups in tries - which is still a huge win. Next, you probably want a way to bound the size of the trie:
Finally, you want to prevent lookups whenever possible...
I'd just like to say... A good solution to this problem is going to incorporate more than a Ternary Search Tree. Ngrams, and Shingles (Phrases) are needed. Word-boundary errors also need to be detected. "hell o" should be "hello" ... and "whitesocks" should be "white socks" - these are pre-processing steps. If you don't preprocess the data properly you aren't going to get valuable search results. Ternary search trees are a useful component in figuring out what is a word, and also for implementing related-word guessing when a word typed isn't a valid word in the index.
The google algorithm performs phrase suggestion and correction. The google algorithm also has some concept of context... if the first word you search for is weather related and you combine them "weatherforcst" vs "monsoonfrcst" vs "deskfrcst" - my guess is behind the scenes rankings are being changed in the suggestion based on the first word encountered - forecast and weather are related words therefore forecast get's a high rank in the Did-You-Mean guess.
word-partials (ngrams), phrase-terms (shingles), word-proximity (word-clustering-index), ternary-search-tree (word lookup).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With