Lucene's algorithm

Tags:

I read the paper by Doug Cutting; "Space optimizations for total ranking".

Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking).

Particularly, the total ranking algorithm described there involves traversing down the entire postings list for each query term, so in case of very common query terms like "yellow dog", either of the 2 terms may have a very very long postings list in case of web search. Are they all really traversed in the current Lucene/Solr? Or are there any heuristics to truncate the list employed?

In the case when only the top k results are returned, I can understand that distributing the postings list across multiple machines, and then combining the top-k from each would work, but if we are required to return "the 100th result page", i.e. results ranked from 990--1000th, then each partition would still have to find out the top 1000, so partitioning would not help much.

Overall, is there any up-to-date detailed documentation on the internal algorithms used by Lucene?

496

asked Apr 25 '12 21:04

teddy teddy

1 Answers

I am not aware of such documentation, but since Lucene is open-source, I encourage you go reading the source code. In particular, the current trunk version includes flexible indexing, meaning that the storage and posting list traversal has been decoupled from the rest of the code, making it possible to write custom codecs.

You assumptions are correct regarding posting list traversal, by default (it depends on your Scorer implementation) Lucene traverses the entire posting list for every term present in the query and puts matching documents in a heap of size k to compute the top-k docs (see TopDocsCollector). So returning results from 990 to 1000 makes Lucene instantiate a heap of size 1000. And if you partition your index by document (another approach could be to split by term), every shard will need to send the top 1000 results to the server which is responsible for merging results (see Solr QueryComponent for example, which translates a query from N to P>N to several shard requests from 0 to P sreq.params.set(CommonParams.START, "0");). This is why Solr might be slower in distributed mode than in standalone mode in case of extreme paging.

I don't know how Google manages to score results efficiently, but Twitter published a paper on their retrieval engine Earlybird where they explain how they patched Lucene in order to do efficient reverse chronological order traversal of the posting lists, which allows them to return the most recent tweets matching a query without traversing the entire posting list for every term.

Update: I found this presentation from Googler Jeff Dean, which explains how Google built its large scale information retrieval system. In particular, it talks about sharding strategies and posting list encoding.

126

answered Oct 13 '22 11:10

jpountz

Related questions
                            
                                How to store a hash table in a file?
                            
                                Simple Suggestion / Recommendation Algorithm
                            
                                Image rotation algorithm [closed]
                            
                                How to find the number of inversions in an array ? [duplicate]
                            
                                How to create an efficient auto-complete? [closed]
                            
                                Determine least common ancestor at compile-time
                            
                                Algorithmic complexity of PHP function strlen()
                            
                                What is the best way to find the period of a (repeating) list in Mathematica?
                            
                                What algorithm is used when using the in operator in python to search a list?
                            
                                Partition a collection into "k" close-to-equal pieces (Scala, but language agnostic)
                            
                                Programming two trains to intersect without positional data or communication (logic puzzle) [closed]
                            
                                Calculate next scheduled time based on cron spec
                            
                                Java, Finding Kth largest value from the array [duplicate]
                            
                                How can I detect these audio abnormalities?
                            
                                Where can I find a good read about bicubic interpolation and Lanczos resampling?
                            
                                Is there an STL algorithm to find the last instance of a value in a sequence?
                            
                                What's wrong with this RGB to XYZ color space conversion algorithm?
                            
                                Laderman's 3x3 matrix multiplication with only 23 multiplications, is it worth it?
                            
                                How can I find the Largest Common Substring between two strings in PHP?
                            
                                How to implement a repeating shuffle that's random - but not too random

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Lucene's algorithm

Tags:

algorithm

indexing

lucene

information-retrieval

inverted-index

teddy teddy

People also ask

1 Answers

jpountz

Recent Activity

Donate For Us