Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR index size reduction

We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .

We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.

Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?

Any thoughts on this would be appreciated... Thanks!

like image 824
jayunit100 Avatar asked Apr 09 '12 22:04

jayunit100


People also ask

How is Solr index size calculated?

If you are looking for the physical on-disk size of the index, you can look at 'data/index' folder under 'dataDir' per the definition in solrconfig. xml. For eg. in example index, it is example/solr/data/index folder.

How can I make Solr index faster?

Tip #6: commit at the end The auto-commit settings, shown below, can be configured in the solrconfig. xml file. If you configure your index to commit at the end of the indexing process, the auto-commit can efficiently perform commits during the indexing process and limit the performance impact on your indexing process.

How does Solr index work?

Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.

What is index in Solr?

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.


1 Answers

There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.

To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.

  • Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
  • Add omitNorms="true" to text fields that don't need length normalization
  • Add omitPositions="true" to text fields that don't require phrase matching
  • Special fields, like NGrams, can take up a lot of space
  • Are you removing stop words from text fields?
like image 182
Nick Clark Avatar answered Nov 10 '22 02:11

Nick Clark