Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice for ensuring Solr/Lucene index is "up to date" after long rebuild

We have a general question about best practice/programming during a long index rebuild. This question is not "solr specific" could just as well apply to raw Lucene or any other similar indexing tool/library/black box.

The question

What is the best practice for ensuring Solr/Lucene index is "absolutely up to date" after long index rebuild i.e. if, during the course of a 12 hour index rebuild, users have add/change/delete db records or files (PDF's), how do you ensure the rebuild index at the very end “includes” these changes?

Context

  • Large Database and Filesystem (e.g. pdfs) indexed in Solr
  • Multi-core solr instance, where core0 is for “search” and all add/changes/deletes core1 is for “rebuild.” Core1 is a “temporary core”.
  • After the end of the rebuild we ‘move’ core1 to core0, so searches and updates go against the freshly-rebuilt db

Current Approach

  • Rebuild process queries the db and/or traverses the filesystem for “all db records” or “all files”
  • The rebuild will “get” new db records/pdfs if they occur at the end of the query/file system traversal. (E.g. The query is “select * from element order by element_id”. If we keep the result set open—i..e rather than build a big list all at once—the result set will include entries added at the end. Similarly if new files get added “at the end” (new folder or new file), file traversal will include these files.
  • The rebuild will not “get” the following:changes or deletion to db records/documents which the rebuild process already processed, “just reindexed”

Proposed approach

  • Track in the Solr client (i.e. via a db table) all add/change/deletes that occur to the db/filesystem
  • At the end of the rebuild (but before swapping the core), process these changes: i.e. delete from the index all deleted records/pdfs , reindex all updates and additions

Follow on

  • Is there a better approach
  • Does solr have any magic means to “meld” core0 into core1

Thanks

like image 511
user331465 Avatar asked Nov 15 '22 06:11

user331465


1 Answers

There are a number of ways to skin this cat.... I am guessing that during the long indexing process of core1 (aka "on deck" core) you are running user queries against an already populated core0 (aka "live" core).

  1. If you can distinguish what has changed, why not just update the live core? If you can run queries against the live core and your filesystem of PDF's to find out which documents have been updated, and which are deleted, just do it all against the live core, and ditch this offline process. This would be the simplest.... Just put the update time of the pdf in your solr document to detect which have changed. If the pdf doesn't exist in solr then add it. Keep a list of solr document ids, and at the end, any that didn't have a matching PDF can be deleted. In the meantime you still have your real time updates coming in.

  2. You could proxy the incoming live updates and multiplex (?) them so they go to both Core1 and Core0. I've built a simple proxy interface and found it very simple. That way all your updates are going to both of your cores and you don't have to do any "reconciliation".

  3. Lastly, you can merge two cores: http://wiki.apache.org/solr/MergingSolrIndexes#Merging_Through_CoreAdmin I don't really know what happens though if you have two documents with the same id, or if a document doesn't exist in one core, but does in the other... I assume it's all an additive process, but you'd want to dig into this.

Love to hear how this goes!

like image 113
Eric Pugh Avatar answered Jun 02 '23 06:06

Eric Pugh