Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up Solr Indexing

Tags:

solr

lucene

I am kind of working on speeding up my Solr Indexing speed. I just want to know by default how many threads(if any) does Solr use for indexing. Is there a way to increase/decrease that number.

like image 841
phanips Avatar asked Aug 24 '11 15:08

phanips


People also ask

How long is Solr indexing?

The changes to solr data configuration explained in this post have reduced the time necessary to index program and actor (and other, with similar changes) data from 8 hours to 12 minutes for a full import, and from 14 hours to 4-6 minutes for delta replaced with clean=false full import.

Why is Solr so fast?

For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster.

What is Solr optimization?

Updates are handled synchronously within an individual Solr instance. Optimization. A process that compacts the index and merges segments in order to improve query performance. Optimization should only be run on the master nodes.


2 Answers

When you index a document, several steps are performed :

  • the document is analyzed,
  • data is put in the RAM buffer,
  • when the RAM buffer is full, data is flushed to a new segment on disk,
  • if there are more than ${mergeFactor} segments, segments are merged.

The first two steps will be run in as many threads as you have clients sending data to Solr, so if you want Solr to run three threads for these steps, all you need is to send data to Solr from three threads.

You can configure the number of threads to use for the fourth step if you use a ConcurrentMergeScheduler (http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/index/ConcurrentMergeScheduler.html). However, there is no mean to configure the maximum number of threads to use from Solr configuration files, so what you need is to write a custom class which call setMaxThreadCount in the constructor.

My experience is that the main ways to improve indexing speed with Solr are :

  • buying faster hardware (especially I/O),
  • sending data to Solr from several threads (as many threads as cores is a good start),
  • using the Javabin format,
  • using faster analyzers.

Although StreamingUpdateSolrServer looks interesting for improving indexing performance, it doesn't support the Javabin format. Since Javabin parsing is much faster than XML parsing, I got better performance by sending bulk updates (800 in my case, but with rather small documents) using CommonsHttpSolrServer and the Javabin format.

You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.

like image 131
jpountz Avatar answered Sep 28 '22 10:09

jpountz


This article describes an approach to scaling indexing with SolrCloud, Hadoop and Behemoth. This is for Solr 4.0 which hadn't been released at the time this question was originally posted.

like image 43
ted.strauss Avatar answered Sep 28 '22 09:09

ted.strauss