I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.
I am trying to understand how Lucene index work.
Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.
If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.
What would be storage and query optimization for this data type change? If there would be any.
Please suggest if I can do some test on my laptop to get a sense.
As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.
Lucene doesn’t care if the values are strings or numbers or dates. All values are just treated as opaque bytes.
For more information, please see this document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With