Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should an index be optimised after incremental indexes in Lucene?

We run full re-indexes every 7 days (i.e. creating the index from scratch) on our Lucene index and incremental indexes every 2 hours or so. Our index has around 700,000 documents and a full index takes around 17 hours (which isn't a problem).

When we do incremental indexes, we only index content that has changed in the past two hours, so it takes much less time - around half an hour. However, we've noticed that a lot of this time (maybe 10 minutes) is spent running the IndexWriter.optimize() method.

The LuceneFAQ mentions that:

The IndexWriter class supports an optimize() method that compacts the index database and speeds up queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.

...but this doesn't seem to give any definition for what "frequently" means. Optimizing is CPU intensive and VERY IO-intensive, so we'd rather not be doing it if we can get away with it. How much is the hit of running queries on an un-optimized index (I'm thinking especially in terms of query performance after a full re-index compared to after 20 incremental indexes where, say, 50,000 documents have changed)? Should we be optimising after every incremental index or is the performance hit not worth it?

like image 854
Mat Mannion Avatar asked Sep 23 '08 09:09

Mat Mannion


People also ask

What is incremental indexing?

An incremental index takes only seconds to perform and is useful on large capacity websites that can take many hours to completely index. When you generate an incremental index, status information is displayed, such as start time, elapsed time, and errors during the indexing process.

How Lucene works internally?

Internally, Lucene refers to documents by an integer document number. The first document added to an index is numbered zero, and each subsequent document added gets a number one greater than the previous. Note that a document's number may change, so caution should be taken when storing these numbers outside of Lucene.

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.


3 Answers

Mat, since you seem to have a good idea how long your current process takes, I suggest that you remove the optimize() and measure the impact.

Do many of the documents change in those 2 hour windows? If only a small fraction (50,000/700,000 is about 7%) are incrementally re-indexed, then I don't think you are getting much value out of an optimize().

Some ideas:

  • Don't do an incremental optimize() at all. My experience says you are not seeing a huge query improvement anyway.
  • Do the optimize() daily instead of 2-hourly.
  • Do the optimize() during low-volume times (which is what the javadoc says).

And make sure you take measurements. These kinds of changes can be a shot in the dark without them.

like image 196
Matt Quail Avatar answered Oct 21 '22 07:10

Matt Quail


An optimize operation reads and writes the entire index, which is why it's so IO intensive!

The idea behind optimize operations is to re-combine all the various segments in the Lucene index into one single segment, which can greatly reduce query times as you don't have to open and search several files per query. If you're using the normal Lucene index file structure (rather than the combined structure), you get a new segment per commit operation; the same as your re-indexes I assume?

I think Matt has great advice and I'd second everything he says - be driven by the data you have. I would actually go a step further and only optmize a) when you need to and b) when you have low query volume.

As query performance is intimately tied to the number of segments in your index, a simple ls -1 index/segments_* | count could be a useful indicator for when in optimization is really needed.

Alternatively, tracking the query performance and volume and kicking off an optimize when you reach unacceptable low performance with acceptably low volume would be a nicer solution.

like image 37
James Brady Avatar answered Oct 21 '22 07:10

James Brady


In this mail, Otis Gospodnetic advices against using optimize, if your index is seeing constant updates. It's from 2007, but calling optimize() is in it's very nature an IO-heavy operation. You could consider using a more stepwise approach; a MergeScheduler

like image 2
Steen Avatar answered Oct 21 '22 07:10

Steen