Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random disconnects from master node NoNodeAvailableException using Elastic Cloud/Found

I'm using elastic cloud (former found) with shield and the transport java client. The app communicating with ES runs on heroku. I'm running a stress test on a staging environment with one node

{
    "cluster_name": ...,
    "status": "yellow", 
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 19,
    "active_shards": 19,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 7,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0
}

A the beginning everything works perfectly. But after some time (3-4 minutes) I begin to get some errors. I've set the log level to trace and these are the errors I've been getting (I've replaced with ... everything that is irrelevant.

org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available: [[...][...][...][inet[...]]{logical_availability_zone=..., availability_zone=..., max_local_storage_nodes=1, region=..., master=true}]
    at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:242)
    at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:78)
    at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [...][inet[...]][indices:data/read/search]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
    at org.elasticsearch.shield.transport.ShieldClientTransportService.sendRequest(ShieldClientTransportService.java:41)
    at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57)
    at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109)
    at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205)
    at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
    at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334)
    at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:416)
    at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1122)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
    ...
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [...][inet[...]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
    ... 

These are my properties

  settings = ImmutableSettings.settingsBuilder()
      .put("client.transport.nodes_sampler_interval", "5s") //Tried it with 30s, same outcome
      .put("client.transport.ping_timeout", "30s")
      .put("cluster.name", clusterName)
      .put("action.bulk.compress", false)
      .put("shield.transport.ssl", true)
      .put("request.headers.X-Found-Cluster", clusterName)
      .put("shield.user", user + ":" + password)
      .put("transport.ping_schedule", "1s") //Tried with 5s, same outcome
      .build();

I've also set for every query I make:

max_query_response_size=100000
timeout_seconds=30

I'm using ElasticSearch 1.7.2 and Shield 1.3.2 with corresponding (same version) clients, Java 1.8.0_65 on my machine - Java 1.8.0_40 on the node.

I was getting the same errors without a stress test, but the errors happened very randomly so I wanted to reproduce. That's why I'm running this in a single node.

I spotted another error in my logs

2016-03-07 23:35:52,177 DEBUG [elasticsearch[Vermin][transport_client_worker][T#7]{New I/O worker #16}] ssl.SslHandler (NettyInternalESLogger.java:debug(63)) - Swallowing an exception raised while writing non-app data
java.nio.channels.ClosedChannelException
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Hot threads

0.0% (111.6micros out of 500ms) cpu usage by thread 'elasticsearch[...][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:745)

After reading this http://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/ I came to understand a little better how the whole communication works. I haven't tested this yet, but I believe that the problem lies there. The thing is though, even if I confirm that the problem is closed query connections, how do I handle this? Keep the config as is and just reconnect? Do I disable keepAlive? If yes, should I be worrying over something else?

like image 971
Alkis Kalogeris Avatar asked Mar 07 '16 21:03

Alkis Kalogeris


1 Answers

Citing this link: https://discuss.elastic.co/t/nonodeavailableexception-with-java-transport-client/37702 by Konrad Beiske

your application could be resolving the ip address at boot time. The ELB can change ip's at any point in time. For the best reliability your application should add all ip's of the ELB to the client and periodically check the DNS service for changes.

The connection timeout of our ELB's are 5 minutes.

Following should help you fix it:

Creating a new TransportClient for every request is not ideal as it will imply a new connection handshake for every request and this will hurt your response time. You could have a pool of TransportClients if you prefer, but it will most likely be an unnecessary overhead as the client is thread safe.

My suggestion is that you create a small singleton service that periodically checks for changes to the DNS service and adds any new ip's to your existing transport client. In theory it could be as naive as just adding all ip's discovered every time it checks as the transport client will discard duplicate addresses and also purges old addresses no longer reachable.

like image 137
Archit Saxena Avatar answered Oct 18 '22 13:10

Archit Saxena