Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much space and processing will be optimized in Lucene index by storing a field as Byte instead of String for billions of documents

I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.

I am trying to understand how Lucene index work.

Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.

If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.

What would be storage and query optimization for this data type change? If there would be any.

Please suggest if I can do some test on my laptop to get a sense.

like image 342
Watt Avatar asked Apr 11 '18 00:04

Watt


1 Answers

As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.

Lucene doesn’t care if the values are strings or numbers or dates. All values are just treated as opaque bytes.

For more information, please see this document.

like image 101
Ali Soltani Avatar answered Oct 20 '22 08:10

Ali Soltani