How to implement autocomplete on a massive dataset

Tags:

I'm trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:

Dictionary<string,List<string>>

where the key is the first two letters. That's OK, but what if I want to do something similar to Pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have to say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok, that was a long question. Thoughts?

998

asked Mar 24 '09 19:03

aquinas

1 Answers

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.

When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.

And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.

answered Sep 22 '22 20:09

Daniel Brückner

Related questions
                            
                                Python vectorizing nested for loops
                            
                                Difference between mt_rand() and rand()
                            
                                Fast tensor rotation with NumPy
                            
                                How to get second-highest salary employees in a table
                            
                                Efficient way to check if std::string has only spaces
                            
                                Why does GCC implement isnan() more efficiently for C++ <cmath> than C <math.h>?
                            
                                Optimizing native hit testing of DOM elements (Chrome)
                            
                                Comparing two values in the form (a + sqrt(b)) as fast as possible?
                            
                                Why defining class as final improves JVM performance?
                            
                                Expressions "j = ++(i | i); and j = ++(i & i); should be a lvalue error?
                            
                                std::vector reserve() and push_back() is faster than resize() and array index, why?
                            
                                Dereferencing type-punned pointer will break strict-aliasing rules
                            
                                How do you view a sample of the call stack in ruby?
                            
                                sorted() using generator expressions rather than lists
                            
                                for loop optimization
                            
                                Java Preprocessor
                            
                                Speeding Up Python
                            
                                One could use a profiler, but why not just halt the program? [closed]
                            
                                Genetic Algorithms and multi-objectives optimization on PYTHON : libraries/tools to use? [closed]
                            
                                How to minify JS or CSS on the fly [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement autocomplete on a massive dataset

Tags:

optimization

autocomplete

aquinas

People also ask

1 Answers

Daniel Brückner

Recent Activity

Donate For Us