Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I estimate the size of a Lucene index?

Tags:

lucene

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?

like image 893
bpapa Avatar asked Sep 15 '08 18:09

bpapa


2 Answers

Here is the lucene index format documentation. The major file is the compound index (.cfs file). If you have term statistics, you can probably get an estimate for the .cfs file size, Note that this varies greatly based on the Analyzer you use, and on the field types you define.

like image 72
Yuval F Avatar answered Nov 09 '22 23:11

Yuval F


The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.

like image 34
alchemical Avatar answered Nov 09 '22 22:11

alchemical