Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the space Limits of Lucene Index?

Tags:

lucene

I am adding Billions of rows to Lucene index, each row is almost 6000 Bytes. Is there any limit on the maximum number of rows that can be added to Lucene Index? How much space would Billion rows of 6000 bytes occupy on Lucene Index. Is there any limit for this size?

like image 569
Sravan Avatar asked Jul 05 '12 12:07

Sravan


People also ask

What is the Lucene index?

A Lucene Index Is an Inverted IndexA term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and the field value. The terms created from text fields are pairs of field name and token.

Where are Lucene indexes stored?

When using the default Sitefinity CMS search service (Lucene), the search index definition (configurations which content to be indexed) is stored in your website database, and the actual search index files – on the file system. By default, the search index files are in the ~/App_Data/Sitefinity/Search/ folder.

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

What is the difference between Lucene and Elasticsearch?

Lucene or Apache Lucene is an open-source Java library used as a search engine. Elasticsearch is built on top of Lucene. Elasticsearch converts Lucene into a distributed system/search engine for scaling horizontally.


1 Answers

See Lucene documentation for its limitations, it cannot have more than

  • ~ 274 billion distinct terms,
  • ~ 2.1 billion documents.

For such large datasets, it is generally a good idea to only use Lucene for its inverted index, and to store the actual content of documents somewhere else. You can expect the index size to be ~ 30% of the size of the original corpus of documents (provided these are regular documents, computationally-generated documents with a lot of unique terms would generate a much bigger index).

like image 117
jpountz Avatar answered Nov 14 '22 01:11

jpountz