Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue. After a restart of the whole cluster one of the shard stop being able to index/store documents. We had no hint about the issue until we started indexing (querying the server looks fine). The error is:

2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader     (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
  at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
  at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
  at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
  at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
  at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
  at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
  at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)

We run Solr 4.7 in Cluster mode (5 shards) on jetty. Each shard run on a different host with one zookeeper server.

I checked the zookeeper log and I cannot see anything there.

The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.

  45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
  74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
  74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx

Not even sure if this is relevant. (Can it be?) Any clue what other check can we do?

like image 426
giorgio Avatar asked May 19 '14 17:05

giorgio


1 Answers

We've experienced this error under 2 conditions.

Condition 1

On a single zookeeper host there was an orphaned Zookeeper ephemeral node in /overseer_elect/election. The session this ephemeral node was associated with no longer existed. zookeeper election nodes

The orphaned ephemeral node cannot be deleted. Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355

This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.

To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.

If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds You will need to restart the Solr hosts as well to resolve the problem.

Condition 2

Cause: a mis-configured systemd service unit. Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.

systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.

Solr logs reported the same error the OP posted.

Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:

{"core":"collection-name_shard1_replica1", "core_node_name":"core_node7", "base_url":"http://10.10.10.21:8983/solr", "node_name":"10.10.10.21:8983_solr"}

But the node is completely missing on the cluster with duplicate solr instances attempting to start.

This error also appeared in the Solr Logs: HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json

To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.

like image 192
Ben DeMott Avatar answered Sep 28 '22 11:09

Ben DeMott