I read the paper by Doug Cutting; "Space optimizations for total ranking".
Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking).
Particularly, the total ranking algorithm described there involves traversing down the entire postings list for each query term, so in case of very common query terms like "yellow dog", either of the 2 terms may have a very very long postings list in case of web search. Are they all really traversed in the current Lucene/Solr? Or are there any heuristics to truncate the list employed?
In the case when only the top k results are returned, I can understand that distributing the postings list across multiple machines, and then combining the top-k from each would work, but if we are required to return "the 100th result page", i.e. results ranked from 990--1000th, then each partition would still have to find out the top 1000, so partitioning would not help much.
Overall, is there any up-to-date detailed documentation on the internal algorithms used by Lucene?
It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions. Apache Lucene is an open source project available for free download.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query.
A Lucene Index Is an Inverted Index An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token.
I am not aware of such documentation, but since Lucene is open-source, I encourage you go reading the source code. In particular, the current trunk version includes flexible indexing, meaning that the storage and posting list traversal has been decoupled from the rest of the code, making it possible to write custom codecs.
You assumptions are correct regarding posting list traversal, by default (it depends on your Scorer implementation) Lucene traverses the entire posting list for every term present in the query and puts matching documents in a heap of size k to compute the top-k docs (see TopDocsCollector). So returning results from 990 to 1000 makes Lucene instantiate a heap of size 1000. And if you partition your index by document (another approach could be to split by term), every shard will need to send the top 1000 results to the server which is responsible for merging results (see Solr QueryComponent for example, which translates a query from N to P>N to several shard requests from 0 to P sreq.params.set(CommonParams.START, "0");
). This is why Solr might be slower in distributed mode than in standalone mode in case of extreme paging.
I don't know how Google manages to score results efficiently, but Twitter published a paper on their retrieval engine Earlybird where they explain how they patched Lucene in order to do efficient reverse chronological order traversal of the posting lists, which allows them to return the most recent tweets matching a query without traversing the entire posting list for every term.
Update: I found this presentation from Googler Jeff Dean, which explains how Google built its large scale information retrieval system. In particular, it talks about sharding strategies and posting list encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With