What is the algorithm to search an index for multiple values?

Tags:

This is actually a real problem I'm working on, but for simplicity, let's pretend I'm Google.

Say the user searches for "nanoscale tupperware". There aren't very many pages with both words... only about 3k. But there are ~2 million pages with "nanoscale" and ~4 million with "tupperware". Still, Google finds the 3k for me in 0.3 seconds.

How does it do it?

The only algorithm I'm aware of is to get the documents for "nanoscale", get the documents for "tupperware", and then do a list merge. But that's O(N + M), or O(5,000,000) which seems a little slow. Particularly if I'm running it on a desktop instead of an uber-fast cluster.

So is that actually what Google is doing, and their speed is due mostly to the fact that they're running this expensive computation on their massive distributed cluster?

Or is there a better algorithm that I'm not aware of? Wikipedia and Google aren't turning up anything for me.

Edit:

Since people seem to be focusing on the Google aspect of my question, I guess I'll restate it in the actual terms.

I have several very large (millions of items) indexes implemented as key/value pairs. Keys are simple words, values are Sets of documents. A common use case is to get the intersection of results on several searches on different indexes: the pain point is getting the intersection of the document sets.

I can re-implement my indexes however I want - it's mostly an academic project at this point.

369

asked Feb 22 '10 19:02

levand

1 Answers

The way you're describing it, you already have an inverted index, with a posting list for each term (the list of documents). I'm not aware of a better solution than merge joining the posting lists for each term, and to the best of my knowledge, that's what fulltext indexing solutions like Lucene do. There's a couple of obvious optimisations you can make here, though:

If you can store your dataset in memory, even distributed across many machines, you can merge join the result sets very quickly indeed, compared to what'd be required for a disk seek.
The 'naive' merge join algorithm advances one pointer by one position on each non-match, but if your posting lists are themselves indexed, you can do a lot better, by taking the maximum of the individual current values, and seeking in all the other posting lists to the first value greater than or equal to that key - possibly skipping millions of irrelevant results in the process. This has been called a zig-zag merge join.

187

answered Oct 19 '22 05:10

Nick Johnson

Related questions
                            
                                Calculate Impulse/Torque for both bodies in a 3D fix joint constraint
                            
                                How to cover a set of circles in a plane with disjoint circles of constant radius?
                            
                                How to generate combinations from a set of objects?
                            
                                Topological sort based on a comparator (rather than a graph)
                            
                                How to localize a signal given the location of three receivers and the times at which when they receive the signal (Time Delay of Arrival)?
                            
                                Algorithm for connecting nails on a board with a string
                            
                                Simplifying straight line movements in a list of step by step (x,y) coordinates
                            
                                How to understand this priority queue depth-first search?
                            
                                Minimizing this error function, using NumPy
                            
                                Finding the best possible subset combinations of numbers to reach a given sum or closest to it
                            
                                given an infinite sequence break it into intervals, and return a new infinite sequence with the average of each interval
                            
                                Understanding what's happening in the Kadane Algorithm (Python)
                            
                                C++: How to split a string by ' ' (space), but not by '\ ' (backslash space)?
                            
                                How to implement a "related" degree measure algorithm?
                            
                                Algorithm to create hex flood puzzle
                            
                                Programmatically creating vector arrows in KML
                            
                                Adding summary statistics (or even raw data points) to dodged position boxplots
                            
                                Best hash function for mixed numeric and literal identifiers
                            
                                Nearest point on concave surface from point
                            
                                Seeking algo for text diff that detects and can group similar lines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the algorithm to search an index for multiple values?

Tags:

language-agnostic

algorithm

indexing

search

levand

People also ask

1 Answers

Nick Johnson

Recent Activity

Donate For Us