Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lucene Fields vs. DocValues

Tags:

solr

lucene

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.

So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields (like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?

First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?

Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?

Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...

Also, and perhaps most important, when should I use DocValues and when regular fields?

Joseph

like image 814
Yossi Vainshtein Avatar asked Mar 10 '15 09:03

Yossi Vainshtein


People also ask

What are stored fields and docvalues in Lucene?

Lucene provides two possibilities for that: stored fields and docvalues. Stored fields have the purpose to store the value of the fields (without any analysis) in order to retrieve them at query time. Docvalues have been introduced in order to speedup operation such as faceting, sorting and grouping.

What are the benefits of using docvalues?

Concluding, the use of docvalues leads to several benefits for the performance point of view (faceting, sorting and grouping) and they can even speed up fields retrieval if only few docvalues fields and no store fields are used. Moreover, docvalues are likely to use less space than stored fields.

What happens when we index a document in Lucene?

When we index a document in lucene, the information about the original fields that have been indexed are lost. Fields are analyzed, transformed and indexed accordingly with the schema configuration. Without any additional data structure, when we search for a document, we get the id of the searched document but not the original fields.

What is a docvalue field?

From the Solr Community Wiki: DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time.


Video Answer


1 Answers

Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

...

DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.

The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.

If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).

Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

like image 162
MatsLindh Avatar answered Oct 21 '22 09:10

MatsLindh