This is actually a real problem I'm working on, but for simplicity, let's pretend I'm Google.
Say the user searches for "nanoscale tupperware". There aren't very many pages with both words... only about 3k. But there are ~2 million pages with "nanoscale" and ~4 million with "tupperware". Still, Google finds the 3k for me in 0.3 seconds.
How does it do it?
The only algorithm I'm aware of is to get the documents for "nanoscale", get the documents for "tupperware", and then do a list merge. But that's O(N + M), or O(5,000,000) which seems a little slow. Particularly if I'm running it on a desktop instead of an uber-fast cluster.
So is that actually what Google is doing, and their speed is due mostly to the fact that they're running this expensive computation on their massive distributed cluster?
Or is there a better algorithm that I'm not aware of? Wikipedia and Google aren't turning up anything for me.
Edit:
Since people seem to be focusing on the Google aspect of my question, I guess I'll restate it in the actual terms.
I have several very large (millions of items) indexes implemented as key/value pairs. Keys are simple words, values are Sets of documents. A common use case is to get the intersection of results on several searches on different indexes: the pain point is getting the intersection of the document sets.
I can re-implement my indexes however I want - it's mostly an academic project at this point.
As Tim Roberts said that the BinarySearch is used to search the entire sorted List<T> for an element using the default comparer and returns the zero-based index of the element. So you can't get multiple values.
1. A procedure to build beforehand a data structure or index designed to speed up searches. Learn more in: A Pagination Method for Indexes in Metric Databases.
The way you're describing it, you already have an inverted index, with a posting list for each term (the list of documents). I'm not aware of a better solution than merge joining the posting lists for each term, and to the best of my knowledge, that's what fulltext indexing solutions like Lucene do. There's a couple of obvious optimisations you can make here, though:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With