Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr approaches to re-indexing large document corpus

We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:

  • New features are introduced around searching the existing corpus that require additional schema fields which we can't always anticipate in advance
  • The corpus is indexed across multiple shards. When it grows past a certain threshold, we need to create more shards and re-balance documents evenly across all of them (which SolrCloud does not seem to yet support).

The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.

The approaches we are looking into at the moment are:

  • Create a new cluster of shards and batch re-index there while the old cluster is still available for searching. New documents that are not part of the re-indexed batch are sent to both the old cluster and the new cluster. When ready to switch, point the load balancer to the new cluster.
  • Use CoreAdmin: spawn a new core per shard and send the re-indexed batch to the new cores. New documents that are not part of the re-indexed batch are sent to both the old cores and the new cores. When ready to switch, use CoreAdmin to dynamically swap cores.

We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.

like image 924
gstathis Avatar asked Nov 04 '22 22:11

gstathis


1 Answers

This may not be applicable to you guys, but I'll offer my approach to this problem.

Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.

With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.

The approach we ended up going with was a very simple clustered master/slave setup.

Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.

When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.

So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.

Hope this helps!

like image 127
Jason Palmer Avatar answered Nov 15 '22 09:11

Jason Palmer