I am referring to the algorithm that is used to give query suggestions when a user types a search term in Google. I am mainly interested in: 1. Most important results (most likely queries rather than anything that matches) 2. Match substrings 3. Fuzzy matches I know you could use Trie or generalized trie to find matches, but it wouldn't meet the above requirements... Similar questions asked earlier here

For (heh) awesome fuzzy/partial string matching algorithms, check out Damn Cool Algorithms: <ul> <li>http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees</li> <li>http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata</li> </ul> These don't replace tries, but rather prevent brute-force lookups in tries - which is still a huge win. Next, you probably want a way to bound the size of the trie: <ul> <li>keep a trie of recent/top N words used globally;</li> <li>for each user, keep a trie of recent/top N words for that user.</li> </ul> Finally, you want to prevent lookups whenever possible... <ul> <li>cache lookup results: if the user clicks through on any search results, you can serve those very quickly and then asynchronously fetch the full partial/fuzzy lookup.</li> <li>precompute lookup results: if the user has typed "appl", they are likely to continue with "apple", "apply".</li> <li>prefetch data: for instance, a web app can send a smaller set of results to the browser, small enough to make brute-force searching in JS viable.</li> </ul>

I'd just like to say... A good solution to this problem is going to incorporate more than a Ternary Search Tree. Ngrams, and Shingles (Phrases) are needed. Word-boundary errors also need to be detected. "hell o" should be "hello" ... and "whitesocks" should be "white socks" - these are pre-processing steps. If you don't preprocess the data properly you aren't going to get valuable search results. Ternary search trees are a useful component in figuring out what is a word, and also for implementing related-word guessing when a word typed isn't a valid word in the index. The google algorithm performs phrase suggestion and correction. The google algorithm also has some concept of context... if the first word you search for is weather related and you combine them "weatherforcst" vs "monsoonfrcst" vs "deskfrcst" - my guess is behind the scenes rankings are being changed in the suggestion based on the first word encountered - forecast and weather are related words therefore forecast get's a high rank in the Did-You-Mean guess. word-partials (ngrams), phrase-terms (shingles), word-proximity (word-clustering-index), ternary-search-tree (word lookup).

Algorithm for autocomplete?

2 Answers

For (heh) awesome fuzzy/partial string matching algorithms, check out Damn Cool Algorithms:

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata

These don't replace tries, but rather prevent brute-force lookups in tries - which is still a huge win. Next, you probably want a way to bound the size of the trie:

keep a trie of recent/top N words used globally;
for each user, keep a trie of recent/top N words for that user.

Finally, you want to prevent lookups whenever possible...

cache lookup results: if the user clicks through on any search results, you can serve those very quickly and then asynchronously fetch the full partial/fuzzy lookup.
precompute lookup results: if the user has typed "appl", they are likely to continue with "apple", "apply".
prefetch data: for instance, a web app can send a smaller set of results to the browser, small enough to make brute-force searching in JS viable.

176

answered Sep 19 '22 11:09

fearlesstost

I'd just like to say... A good solution to this problem is going to incorporate more than a Ternary Search Tree. Ngrams, and Shingles (Phrases) are needed. Word-boundary errors also need to be detected. "hell o" should be "hello" ... and "whitesocks" should be "white socks" - these are pre-processing steps. If you don't preprocess the data properly you aren't going to get valuable search results. Ternary search trees are a useful component in figuring out what is a word, and also for implementing related-word guessing when a word typed isn't a valid word in the index.

The google algorithm performs phrase suggestion and correction. The google algorithm also has some concept of context... if the first word you search for is weather related and you combine them "weatherforcst" vs "monsoonfrcst" vs "deskfrcst" - my guess is behind the scenes rankings are being changed in the suggestion based on the first word encountered - forecast and weather are related words therefore forecast get's a high rank in the Did-You-Mean guess.

word-partials (ngrams), phrase-terms (shingles), word-proximity (word-clustering-index), ternary-search-tree (word lookup).

answered Sep 20 '22 11:09

Ben DeMott

Related questions
                            
                                Difference between 2 numbers
                            
                                Programming theory: Solve a maze
                            
                                How to find a duplicate element in an array of shuffled consecutive integers?
                            
                                Rolling variance algorithm
                            
                                What is the probability that the array will remain the same?
                            
                                Storing 1 million phone numbers [closed]
                            
                                How to generate Sudoku boards with unique solutions
                            
                                Select k random elements from a list whose elements have weights
                            
                                C How to "draw" a Binary Tree to the console [closed]
                            
                                Possible Interview Question: How to Find All Overlapping Intervals
                            
                                Which is faster, Hash lookup or Binary search?
                            
                                Test if a number is fibonacci
                            
                                Longest equally-spaced subsequence
                            
                                What's the difference between `git diff --patience` and `git diff --histogram`?
                            
                                Strategies for simplifying math expressions
                            
                                robust algorithm for surface reconstruction from 3D point cloud?
                            
                                Representing logic as data in JSON
                            
                                Difference between O(n) and O(log(n)) - which is better and what exactly is O(log(n))?
                            
                                Maximize the rectangular area under Histogram
                            
                                Most elegant way to change 0 to 1 and vice versa

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm for autocomplete?

Tags:

algorithm

data-structures

autocomplete

scalability

autosuggest

StackUnderflow

People also ask

2 Answers

fearlesstost

Ben DeMott

Recent Activity

Donate For Us