Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are docValues in Solr? When should I use them?

Tags:

solr

lucene

So, I have read multiple sources that try to explain what 'docValues' are in Solr, but I don't seem to understand when I should use them, especially in relation to indexed vs stored fields. Can anyone please throw some light on it?

like image 856
gravetii Avatar asked Aug 20 '18 07:08

gravetii


People also ask

What is the purpose of field analysis in Solr?

Field analysis tells Solr what to do with incoming data when building an index. A more accurate name for this process would be processing or even digestion, but the official name is analysis.

What is Solr inverted index?

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document).

What is Uninvertible Solr?

With the recent release of Solr 7.6. 0 we got a new option for the fields and field types – the property called uninvertible. It allows us to control what Solr will do when it will require data in an uninverted format, so for example when using faceting or sorting.

What is schema in Solr?

Solr schema file (schema.xml) The Solr search engine uses a schema. xml file to describe the structure of each data index. This XML files determines how Solr will build indexes from input documents, and how to perform index and query time processing. As well as describing the structure of the index, schema.

What are docvalues in Solr?

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing. Why DocValues? The standard way that Solr builds the index is with an inverted index.

What information does Solr provide in the explain information?

Another interesting clue that Solr provides in the explain information is what values were searched for in a given field. For example, if I search for the word alcoholism (instead of alcohol) the Solr explain result would show that in one of the fields it used the stemmed version of the search term and in other it used the original text.

What are the available Solr field types?

The available Solr field types are: If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type. If the field is multi-valued, Lucene will use the SORTED_SET type.

When should I use docvalues for a field?

If docValues="true" for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries. Field values retrieved during search queries are typically returned from stored values.


3 Answers

What are docValues in Solr ?

Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.

To illustrate with json:

  • row-oriented (stored fields)

    
    {
    'doc1': {'A':1, 'B':2, 'C':3},
    'doc2': {'A':2, 'B':3, 'C':4},
    'doc3': {'A':4, 'B':3, 'C':2}
    }
    
  • column-oriented (docValues)

    
    {
    'A': {'doc1':1, 'doc2':2, 'doc3':4},
    'B': {'doc1':2, 'doc2':3, 'doc3':3},
    'C': {'doc1':3, 'doc2':4, 'doc3':2}
    }
    

Purpose of DocValues ?

Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.

However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.

For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.

Now this problem can be approached in two ways:

  • Using existing stored fields. In that case if you start sorting/aggregating on a given field, data will be lazily un-inverted and put into a fieldCache at search time so that you can access values given a document ID. This process is very CPU and I/O intensive.
  • DocValues are quite fast to access at search time, since they are stored column-stride such that only the value for that one field needs to be decoded per hit. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.

When should I use them ?

For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.

like image 142
kpahwa Avatar answered Oct 19 '22 00:10

kpahwa


Use cases of DocValues are already explained by @Persimmonium and are pretty clear. they are good for faceting and sorting and such fancy stuff in the IR world.

What are docValue and why they are there ? docValue is nothing but a way to build a forward index so that documents point to values. they are built to overcome the limitations of FieldCache by providing a document to value mapping built at index time and they store values in a column based fashion and it does all the heavyweight lifting during document indexing.

What docvalues are:

NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.

Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.

Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.

Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.

What docvalues are not:

Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).

Use case to use with Lucene docValues this way.

    public Bits getDocsWithField(FieldInfo field) throws IOException {
  switch(field.getDocValuesType()) {
    case SORTED_SET:
      return DocValues.docsWithValue(getSortedSet(field), maxDoc);
    case SORTED_NUMERIC:
      return DocValues.docsWithValue(getSortedNumeric(field), maxDoc);
    case SORTED:
      return DocValues.docsWithValue(getSorted(field), maxDoc);
    case BINARY:
      BinaryEntry be = binaries.get(field.number);
      return getMissingBits(be.missingOffset);
    case NUMERIC:
      NumericEntry ne = numerics.get(field.number);
      return getMissingBits(ne.missingOffset);
    default:
      throw new AssertionError();
  }
}
like image 33
Prakhar Nigam Avatar answered Oct 19 '22 00:10

Prakhar Nigam


Due to the way they are stored and accessed, they will speed up some operations, like sorting, faceting etc.

Besides, they are mandatory for using some features: streaming expressions, in place updates...

So, if in doubt:

  1. if you don't have a big index, and size is not a problem, just enable them
  2. if you do have a huge index, or indexing perf is critical, look into them more carefully and pick which fields to enable them on
like image 8
Persimmonium Avatar answered Oct 18 '22 22:10

Persimmonium