What are the internals of storage and search that allow this? As in the nitty gritties? For example, I have a million documents matched by a term and a million others matched by a second term of an AND query. How does lucene do an intersection so fast in giving me top k? Does it store the document in order of increasing doc IDS for every term? And then when two terms' documents have to be intersected, it looks for the first common k documents in both sets by iterating over them both incrementally, in a single pass. Or, does it use simple unordered hash set from the larger documents array to find the common documents? Or are both such(or possibly more) types of intersection polices used depending on the number of documents asked by user, those matched by individual terms etc among other factors? Any articles which could point out the nitty gritty of document array merge will be appreciated. Edit: Thanks for the info guys. It makes sense now. Skip lists do the magic. I will dig more into it to gain clear understanding.

<ol> <li>Indexes contains sorted documents. when you query with and operator(term1 AND term2) it use two iterators so when you know that first term1 starts with docN, you can skip all document for term2 till docN. So there are not only iterator with next method, but very efficient skipTo method. It is implemented with Skip list index(http://en.wikipedia.org/wiki/Skip_list). So by using method next and skipTo we iterate very fast over large chunks, and as data is sparse(those will not work for usual database for example) it very efficient.</li> <li>Other point that lucene hold only N best so it is much faster than sort all scores document. If you request 10 best it is twice faster than if you request 20 best documents</li> </ol>

how does lucene calculate intersection of documents so fast?

Tags:

search

full-text-search

lucene

full-text-indexing

What are the internals of storage and search that allow this? As in the nitty gritties?

For example, I have a million documents matched by a term and a million others matched by a second term of an AND query. How does lucene do an intersection so fast in giving me top k?

Does it store the document in order of increasing doc IDS for every term? And then when two terms' documents have to be intersected, it looks for the first common k documents in both sets by iterating over them both incrementally, in a single pass.

Or, does it use simple unordered hash set from the larger documents array to find the common documents?

Or are both such(or possibly more) types of intersection polices used depending on the number of documents asked by user, those matched by individual terms etc among other factors?

Any articles which could point out the nitty gritty of document array merge will be appreciated.

Edit: Thanks for the info guys. It makes sense now. Skip lists do the magic. I will dig more into it to gain clear understanding.

418

asked Oct 07 '11 23:10

Guy Sensei

3 Answers

Indexes contains sorted documents. when you query with and operator(term1 AND term2) it use two iterators so when you know that first term1 starts with docN, you can skip all document for term2 till docN. So there are not only iterator with next method, but very efficient skipTo method. It is implemented with Skip list index(http://en.wikipedia.org/wiki/Skip_list). So by using method next and skipTo we iterate very fast over large chunks, and as data is sparse(those will not work for usual database for example) it very efficient.
Other point that lucene hold only N best so it is much faster than sort all scores document. If you request 10 best it is twice faster than if you request 20 best documents

136

answered Sep 21 '22 12:09

Related questions
                            
                                Go find files in directory recursively
                            
                                PHP: How to search a file using wildcards
                            
                                Lucene or Mysql Full text search [closed]
                            
                                Is there a way to send a request after the user has stopped typing?
                            
                                how to query an element from a list in pymongo
                            
                                How to exclude file from PhpStorm global search (Ctrl+Shift+F)
                            
                                Finding duplicate values in arraylist
                            
                                jquery/javascript check string for multiple substrings
                            
                                Algorithm (or C# library) for identifying 'keywords' in a set of messages? [closed]
                            
                                How to check against all joins when generating a score using MYSQL
                            
                                Rails 5 Live Search with Keyup losing input focus Turbolinks
                            
                                Is there a script to manage/search python snippets which understands python code like nullege.com?
                            
                                How do I use regex to search ignoring certain characters with NSPredicate?
                            
                                How to change the search action in wordpress search bar?
                            
                                Can Drupal's search module search for a substring? (Partial Search)
                            
                                Chrome 22 Developer Tools - Global Sources Search not working right (ctrl-shift-f)
                            
                                PHP, MySQL, Efficient tag-driven search algorithm
                            
                                Spatial data structure for finding all points greater than or less than a value in each cartesian dimension
                            
                                Prefix search against half a billion strings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how does lucene calculate intersection of documents so fast?

Tags:

search

full-text-search

lucene

full-text-indexing

Guy Sensei

People also ask

3 Answers

yura

A. Coady

Xodarap

Recent Activity

Donate For Us