We are looking for some recommendations around systematically re-indexing in Solr an ever growing corpus of documents (tens of millions now, hundreds of millions in than a year) without taking the currently running index down. Re-indexing is needed on a periodic bases because:
The current index receives very frequent updates and additions, which need to be available for search within minutes. Therefore, approaches where the corpus is re-indexed in batch offline don't really work as by the time the batch is finished, new documents will have been made available.
The approaches we are looking into at the moment are:
We'd appreciate if folks can either confirm or poke holes in either or all these approaches. Is one more appropriate than the other? Or are we completely off? Thank you in advance.
This may not be applicable to you guys, but I'll offer my approach to this problem.
Our Solr setup is currently a single core. We'll be adding more cores in the future, but the overwhelming majority of the data is written to a single core.
With this in mind, sharding wasn't really applicable to us. I looked into distributed searches - cutting up the data and having different slices of it running on different servers. This, to me, just seemed to complicate things too much. It would make backup/restores more difficult and you end up losing out on certain features when performing distributed searches.
The approach we ended up going with was a very simple clustered master/slave setup.
Each cluster consists of a master database, and two solr slaves that are load balanced. All new data is written to the master database and the slaves are configured to sync new data every 5 minutes. Under normal circumstances this is a very nice setup. Re-indexing operations occur on the master, and while this is happening the slaves can still be read from.
When a major re-indexing operation is happening, I remove one slave from the load balancer and turn off polling on the other. So, the customer facing Solr database is now not syncing with the master, while the other is being updated. Once the re-index is complete and the offline slave database is in sync, I add it back to the load balancer, remove the other slave database from the load balancer, and re-configure it to sync with the master.
So far this has worked very well. We currently have around 5 million documents in our database and this number will scale much higher across multiple clusters.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With