Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how does lucene calculate intersection of documents so fast?

What are the internals of storage and search that allow this? As in the nitty gritties?

For example, I have a million documents matched by a term and a million others matched by a second term of an AND query. How does lucene do an intersection so fast in giving me top k?

Does it store the document in order of increasing doc IDS for every term? And then when two terms' documents have to be intersected, it looks for the first common k documents in both sets by iterating over them both incrementally, in a single pass.

Or, does it use simple unordered hash set from the larger documents array to find the common documents?

Or are both such(or possibly more) types of intersection polices used depending on the number of documents asked by user, those matched by individual terms etc among other factors?

Any articles which could point out the nitty gritty of document array merge will be appreciated.

Edit: Thanks for the info guys. It makes sense now. Skip lists do the magic. I will dig more into it to gain clear understanding.

like image 418
Guy Sensei Avatar asked Oct 07 '11 23:10

Guy Sensei


People also ask

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

How does Lucene index work?

In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

What kind of index does Lucene use?

The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents.

What is Apache Lucene used for?

Apache Lucene™ is a high-performance, full-featured search engine library written entirely in Java. It is a technology suitable for nearly any application that requires structured search, full-text search, faceting, nearest-neighbor search across high-dimensionality vectors, spell correction or query suggestions.


3 Answers

  1. Indexes contains sorted documents. when you query with and operator(term1 AND term2) it use two iterators so when you know that first term1 starts with docN, you can skip all document for term2 till docN. So there are not only iterator with next method, but very efficient skipTo method. It is implemented with Skip list index(http://en.wikipedia.org/wiki/Skip_list). So by using method next and skipTo we iterate very fast over large chunks, and as data is sparse(those will not work for usual database for example) it very efficient.
  2. Other point that lucene hold only N best so it is much faster than sort all scores document. If you request 10 best it is twice faster than if you request 20 best documents
like image 136
yura Avatar answered Sep 21 '22 12:09

yura


Lucene will intersect sorted docs ids or use windows of bitsets, depending on the situation. See the comments at the top of BooleanScorer.

like image 38
A. Coady Avatar answered Sep 19 '22 12:09

A. Coady


Intersection is similar to a sort-merge join, except the IDs are already sorted. See this blog post for more information.

like image 25
Xodarap Avatar answered Sep 22 '22 12:09

Xodarap