Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can one make Apache Solr index transactionally consistent with DB being indexed?

I am new to Solr. I am trying to make a server that stores structured data in a database, and which can be searched using Solr/Lucene. The server can be is clustered into any number of identical nodes for high availability.

It seems that the standard configuration Solr stores the index in a file on the file system. This seems to introduce some problems with consistency and clustering.

How do I make the index transactionally consistent with the DB? Is there a way to do this? (e.g. some way to make commits to the DB coordinated with commits to the Solr index?)

Is there any way to store the index in the (relational) DB? This would solve the consistency problems and cluster problems, but I don't find a lot of literature on how to do this.

When configured as a cluster, does each cluster node need to maintain it's own copy of the index. It is not clear whether multiple instances of Solr can update a single index or not.

Or -- do we give up accept that the index is not guaranteed to be consistent, rebuild it every day or so? What do people normally do about this?

like image 784
AgilePro Avatar asked Oct 19 '12 02:10

AgilePro


People also ask

Can we use Solr as database?

Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support. It is a document database that offers SQL support and executes it in a distributed manner.

How does Solr indexing work?

Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.

How can I make Solr index faster?

Tip #6: commit at the end The auto-commit settings, shown below, can be configured in the solrconfig. xml file. If you configure your index to commit at the end of the indexing process, the auto-commit can efficiently perform commits during the indexing process and limit the performance impact on your indexing process.


2 Answers

Q> How do I make the index transactionally consistent with the DB?
A> You can't. You can probably invent another transaction layer on top, but it will take ages to develop and you won't reach 100% consistency anyway. You could, for example, send data both to the DB and Solr and only commit after both data arrives but this will not be atomic.

Q> Is there any way to store the index in the (relational) DB?
A> With Lucene 4.0, you probably can (by writing your own codec). But this won't solve your problem.

Q> When configured as a cluster, does each cluster node need to maintain it's own copy of the index?
A> Yes.

Q> It is not clear whether multiple instances of Solr can update a single index or not.
A> Multiple Lucene/Solr instances can't write to the same index file(s). Max you can do is to create multiple IndexSearchers. But this is probably done at Solr level anyway.

Q> do we give up accept that the index is not guaranteed to be consistent?
A> Yes. I think you are too db-centric. Think about Solr/Lucene as you think about Google - I bet they don't roll out their entire index atomically throughout the world. If search results will have minor inconsistencies depending which server you hit (for a few seconds of course), it's not a big deal.

Q> rebuild it every day or so? What do people normally do about this?
A> Lucene has near-real time search but at the basic level you just send index updates and commit as db changes happen, then reopen the index reader to see these updates. This is all done automagically in Solr.

like image 68
mindas Avatar answered Oct 11 '22 06:10

mindas


In know this is a bit old but it might help someone. You can try solrcloud with Apache zookeeper.

Apache Solr out of the box includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability- Called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features with little config:

Central configuration for the entire cluster
Automatic load balancing and fail-over for queries
ZooKeeper integration for cluster coordination and configuration.

Zookeeper is a cluster manager for solr. It works really well with solr.

https://cwiki.apache.org/confluence/display/solr/SolrCloud

http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
like image 21
Victor Odiah Avatar answered Oct 11 '22 06:10

Victor Odiah