How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

Tags:

From a data structure point of view how does Lucene (Solr/ElasticSearch) so quickly do filtered term counts? For example given all documents that contain the word "bacon" find the counts for all words in those documents.

First, for background, I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.

But how then does Lucene quickly determine counts for all words in documents that match "bacon"? In my naive understanding, Lucene would have to take the bit array associated with bacon and AND it with the bit arrays for every single other word. Am I missing something? I don't understand how this can be efficient. Also, do these bit arrays have to be pulled off of disk? That sounds all the worse!

How's the magic work?

481

asked Oct 16 '14 01:10

JnBrymn

2 Answers

You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image enter image description here

To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview

If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it. Refer to the overview section to see what data gets saved in different lucene indexes. Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways

Really count the number of words if the document is stored
Use Term Dictionary, frequency, and proximity data to get the count.

Finally, which API are you using to "quickly determine counts for all words"

Image credit to http://leanjavaengineering.wordpress.com/

Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description

answered Nov 16 '22 02:11

Sap

there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.

this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html

Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.

DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.

answered Nov 16 '22 02:11

Robert Muir

Related questions
                            
                                Why do Fibonacci heaps need cascading cuts?
                            
                                Find the kth eleven-non-free number
                            
                                Linear time algorithm for Minimum number of jumps required to reach end
                            
                                Infinite lazy bitmap
                            
                                Find a single integer that occurs with even frequency in a given array of ints when all others occur odd with frequency
                            
                                Python data structure
                            
                                Converting Directed Acyclic Graph (DAG) to tree
                            
                                Matlab Table / Dataset type optimization
                            
                                What is the resizing factor of lists in Python
                            
                                String permutations rank + data structure
                            
                                How to update element priorities in a heap for Prim's Algorithm?
                            
                                Does a lock-free queue "multiple producers-single consumer" exist for Delphi?
                            
                                Storing a bucket of numbers in an efficient data structure
                            
                                Is Linked List an ADT or is it a Data Structure, or both?
                            
                                Which node data structure to use for a trie
                            
                                Data structure for non overlapping ranges of integers?
                            
                                Need a clear explanation of Range updates and range queries Binary indexed tree
                            
                                Structure initialization in C++
                            
                                Good way to save data when writing a text editor
                            
                                Java list that automatically sorts elements as I add them [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

Tags:

data-structures

solr

lucene

elasticsearch

JnBrymn

People also ask

2 Answers

Sap

Robert Muir

Recent Activity

Donate For Us