Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid removal of current Lucene.NET index during rebuild

I'm new to Lucene.NET but I'm using an open source tool built for Sitecore CMS that uses Lucene.NET to index lots of content from the CMS. I confirmed yesterday that when I rebuild my indexes, the current index files wipe clean so anything that relies on the index gets no data for about 30-60 seconds (the amount of time for a full index rebuild). Is there a best practice or way to make Lucene.NET not overwrite the current index files until the new index is completely rebuilt? I'm basically thinking I'd like it to write to new temp index files and when the rebuild is done have those files overwrite the current index.

Example of what I'm talking about:

  • Build fresh index (~30 seconds)
  • Index has about 500 documents
  • Use code to access data in index and display on website
  • Rebuild index (~30 seconds)
    • Any code that now reads the index for data returns nothing because the index files are being overwritten; results in website not showing any data
  • Rebuild complete: data now available again, data back on website

Thanks in advance

like image 374
Mark Ursino Avatar asked Jan 07 '11 14:01

Mark Ursino


2 Answers

I have no experience with "Sitecore" itself but here's my story.

We've recently incorporated the index-based search (using Lucene.Net) for our eCommerce sub-system. The index update process for our case might take about half a hour (~50,000 products themselves + lots of related information). To prevent a "denial of service" responses during the update of the index we first create a "backup" version of the it (simply copying index directory to another location) and all further requests are redirected to use this "backup" version. When the index update is completed we delete the backup in order for clients to start using the updated (or "live") version of the index. This is also helps in case of any unhandled exceptions that might occur during the update process becase you might end up in a situation of having no index at all (and in our case clients can always use the "backup" version).

The API reference (Lucene 2.4) of the Lucene.Net.Index.IndexWriter object states the following:

Note that you can open an index with create=true even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open.

So at least you shouldn't worry about the clients that are currently searching within your index.

Hope this will help you to make a right decision.

like image 120
volpav Avatar answered Nov 05 '22 11:11

volpav


I'm not familiar with that sitecore tool, but I can answer how you would do it with pure Lucene.Net: you should use an NRT setup, which means "have one index writer and never close it."

Basically, index writers have a "virtual" index in memory until it gets flushed to disk. So as long as you get your readers from the writer, you'll always see the latest stuff, even if it hasn't been flushed to disk yet.

like image 33
Xodarap Avatar answered Nov 05 '22 13:11

Xodarap