Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storm supervisor connectivity error downloading the jar from nimbus

I am setting up a multi node storm cluster. So I have 3 zookeeper nodes, 1 nimbus, 2 supervisors and 1 storm client node. So when I look at my configurations with zookeeper and nimbus & zookeeper and supervisor all seems to be well. But when it comes to supervisor trying pull the jar file down from the nimbus data directory, the supervisor gets a "Connection refused". Out of frustration, I have even opened up tcp & udp ports (0-65535) between the boxes, but I still end up with connection refused.

I have verified that the permissions on the nimbus data directory are pretty open, and the supervisor should be able to get to the directory and pull the file down fine. Here are the logs.

Nimbus.log:

2014-11-23 07:07:50 b.s.zookeeper [INFO] Zookeeper state update: :connected:none
2014-11-23 07:07:50 o.a.z.ClientCnxn [INFO] EventThread shut down
2014-11-23 07:07:50 o.a.z.ZooKeeper [INFO] Session: 0x249d964a3c20008 closed
2014-11-23 07:07:50 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-11-23 07:07:50 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=172.31.40.214:2181,172.31.45.110:2181,172.31.47.13:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@40160f3d
2014-11-23 07:07:50 o.a.z.ClientCnxn [INFO] Opening socket connection to server /172.31.40.214:2181
2014-11-23 07:07:50 o.a.z.ClientCnxn [INFO] Socket connection established to ip-172-31-40-214.us-west-2.compute.internal/172.31.40.214:2181, initiating session
2014-11-23 07:07:50 o.a.z.ClientCnxn [INFO] Session establishment complete on server ip-172-31-40-214.us-west-2.compute.internal/172.31.40.214:2181, sessionid = 0x149d964a86c001d, negotiated timeout = 20000
2014-11-23 07:07:50 b.s.d.nimbus [INFO] Delaying event :remove for 30 secs for TestingStormClusterTopology-1-1416724578
2014-11-23 07:07:50 b.s.d.nimbus [INFO] Starting Nimbus server...

2014-11-23 07:08:20 b.s.d.nimbus [INFO] Killing topology: TestingStormClusterTopology-1-1416724578
2014-11-23 07:08:22 b.s.d.nimbus [INFO] Cleaning up TestingStormClusterTopology-1-1416724578

2014-11-23 07:09:39 b.s.d.nimbus [INFO] Uploading file from client to /home/ubuntu/data/storm/nimbus/inbox/stormjar-dc265069-ebde-482f-abee-ccb7915fa663.jar
2014-11-23 07:09:39 b.s.d.nimbus [INFO] Finished uploading file from client: /home/ubuntu/data/storm/nimbus/inbox/stormjar-dc265069-ebde-482f-abee-ccb7915fa663.jar
2014-11-23 07:09:39 b.s.d.nimbus [INFO] Received topology submission for TestingStormClusterTopology with conf {"topology.max.task.parallelism" nil, "topology.acker.executors" nil, "topology.kryo.register" nil, "topology.kryo.decorators" (), "topology.name" "TestingStormClusterTopology", "storm.id" "TestingStormClusterTopology-1-1416726579", "topology.workers" 3}
2014-11-23 07:09:39 b.s.d.nimbus [INFO] Activating TestingStormClusterTopology: TestingStormClusterTopology-1-1416726579
2014-11-23 07:09:39 b.s.s.EvenScheduler [INFO] Available slots: (["30d36d53-ee60-4667-8a37-44c674da23e7" 6703] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6702] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6701] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6700])
2014-11-23 07:09:39 b.s.d.nimbus [INFO] Setting new assignment for topology id TestingStormClusterTopology-1-1416726579: #backtype.storm.daemon.common.Assignment{:master-code-dir "/home/ubuntu/data/storm/nimbus/stormdist/TestingStormClusterTopology-1-1416726579", :node->host {"30d36d53-ee60-4667-8a37-44c674da23e7" "ip-172-31-43-254.us-west-2.compute.internal"}, :executor->node+port {[2 2] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6702], [3 3] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6701], [4 4] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6703], [5 5] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6702], [6 6] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6701], [7 7] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6703], [8 8] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6702], [9 9] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6701], [1 1] ["30d36d53-ee60-4667-8a37-44c674da23e7" 6703]}, :executor->start-time-secs {[1 1] 1416726579, [9 9] 1416726579, [8 8] 1416726579, [7 7] 1416726579, [6 6] 1416726579, [5 5] 1416726579, [4 4] 1416726579, [3 3] 1416726579, [2 2] 1416726579}}
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[2 2] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[3 3] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[4 4] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[5 5] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[6 6] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[7 7] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[8 8] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[9 9] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[1 1] not alive
2014-11-23 07:11:42 b.s.d.nimbus [INFO] Setting new assignment for topology id TestingStormClusterTopology-1-1416726579: #backtype.storm.daemon.common.Assignment{:master-code-dir "/home/ubuntu/data/storm/nimbus/stormdist/TestingStormClusterTopology-1-1416726579", :node->host {}, :executor->node+port {}, :executor->start-time-secs {[1 1] 1416726579, [9 9] 1416726579, [8 8] 1416726579, [7 7] 1416726579, [6 6] 1416726579, [5 5] 1416726579, [4 4] 1416726579, [3 3] 1416726579, [2 2] 1416726579}}
2014-11-23 07:11:52 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[2 2] not alive
2014-11-23 07:11:52 b.s.d.nimbus [INFO] Executor TestingStormClusterTopology-1-1416726579:[3 3] not alive

And here is the superisor.log file.

Supervisor.log

2014-11-23 07:08:55 b.s.d.supervisor [INFO] Starting Supervisor with conf {"dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "storm.messaging.netty.client_worker_threads" 1, "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "nimbus.monitor.freq.secs" 10, "logviewer.childopts" "-Xmx128m", "java.library.path" "/opt/jdk", "topology.executor.send.buffer.size" 1024, "storm.local.dir" "/home/ubuntu/data/storm", "storm.messaging.netty.buffer_size" 5242880, "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "localhost", "storm.messaging.netty.min_wait_ms" 100, "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 1024, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "storm.zookeeper.retry.intervalceiling.millis" 30000, "supervisor.enable" true, "storm.messaging.netty.server_worker_threads" 1, "storm.zookeeper.servers" ["172.31.40.214" "172.31.45.110" "172.31.47.13"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" nil, "topology.transfer.buffer.size" 1024, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx768m", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "nimbus.host.ip" "172.31.47.40", "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "drpc.childopts" "-Xmx768m", "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "storm.messaging.netty.max_retries" 30, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703], "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.message.timeout.secs" 30, "task.refresh.poll.secs" 10, "topology.workers" 1, "supervisor.childopts" "-Xmx256m", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "logviewer.port" 8000, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "storm.thrift.transport" "backtype.storm.security.auth.SimpleTransportPlugin", "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "storm.messaging.transport" "backtype.storm.messaging.netty.Context", "logviewer.appender.name" "A1", "storm.messaging.netty.max_wait_ms" 1000, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8080, "nimbus.childopts" "-Xmx1024m", "storm.cluster.mode" "distributed", "topology.optimize" true, "topology.max.task.parallelism" nil}
2014-11-23 07:08:56 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-11-23 07:08:56 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=172.31.40.214:2181,172.31.45.110:2181,172.31.47.13:2181 sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@76a78717
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Opening socket connection to server /172.31.47.13:2181
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Socket connection established to ip-172-31-47-13.us-west-2.compute.internal/172.31.47.13:2181, initiating session
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Session establishment complete on server ip-172-31-47-13.us-west-2.compute.internal/172.31.47.13:2181, sessionid = 0x349d964c0d30018, negotiated timeout = 20000
2014-11-23 07:08:56 b.s.zookeeper [INFO] Zookeeper state update: :connected:none
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] EventThread shut down
2014-11-23 07:08:56 o.a.z.ZooKeeper [INFO] Session: 0x349d964c0d30018 closed
2014-11-23 07:08:56 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2014-11-23 07:08:56 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=172.31.40.214:2181,172.31.45.110:2181,172.31.47.13:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@603043f6
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Opening socket connection to server /172.31.40.214:2181
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Socket connection established to ip-172-31-40-214.us-west-2.compute.internal/172.31.40.214:2181, initiating session
2014-11-23 07:08:56 o.a.z.ClientCnxn [INFO] Session establishment complete on server ip-172-31-40-214.us-west-2.compute.internal/172.31.40.214:2181, sessionid = 0x149d964a86c001f, negotiated timeout = 20000
2014-11-23 07:08:56 b.s.d.supervisor [INFO] Starting supervisor with id 30d36d53-ee60-4667-8a37-44c674da23e7 at host ip-172-31-43-254.us-west-2.compute.internal


2014-11-23 07:09:39 b.s.d.supervisor [INFO] Downloading code for storm id TestingStormClusterTopology-1-1416726579 from /home/ubuntu/data/storm/nimbus/stormdist/TestingStormClusterTopology-1-1416726579
2014-11-23 07:09:39 b.s.event [ERROR] Error when processing event
java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused
    at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:21) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.utils.Utils.downloadFromMaster(Utils.java:226) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.daemon.supervisor$fn__6326.invoke(supervisor.clj:396) ~[storm-core-0.9.0.1.jar:na]
    at clojure.lang.MultiFn.invoke(MultiFn.java:172) ~[clojure-1.4.0.jar:na]
    at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this__6251.invoke(supervisor.clj:290) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.event$event_manager$fn__3072.invoke(event.clj:24) ~[storm-core-0.9.0.1.jar:na]
    at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Caused by: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused
    at org.apache.thrift7.transport.TSocket.open(TSocket.java:183) ~[libthrift7-0.7.0-2.jar:0.7.0-2]
    at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81) ~[libthrift7-0.7.0-2.jar:0.7.0-2]
    at backtype.storm.security.auth.SimpleTransportPlugin.connect(SimpleTransportPlugin.java:66) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.security.auth.ThriftClient.<init>(ThriftClient.java:46) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:30) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.utils.NimbusClient.<init>(NimbusClient.java:26) ~[storm-core-0.9.0.1.jar:na]
    at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:19) ~[storm-core-0.9.0.1.jar:na]
    ... 7 common frames omitted
Caused by: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.7.0_65]
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[na:1.7.0_65]
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) ~[na:1.7.0_65]
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[na:1.7.0_65]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.7.0_65]
    at java.net.Socket.connect(Socket.java:579) ~[na:1.7.0_65]
    at org.apache.thrift7.transport.TSocket.open(TSocket.java:178) ~[libthrift7-0.7.0-2.jar:0.7.0-2]
    ... 13 common frames omitted
2014-11-23 07:09:39 b.s.util [INFO] Halting process: ("Error when processing an event")

So I am trying to understand if I have to share public and/or private keys across these boxes. I know how to generate public private keys(ssh-keygen), but I am unsure as to what should the strategy be to share the keys across boxes.

I am not even sure if that is the problem, I just confused about what could the connection refused error mean. Apologies for the long post, but I wanted to provide as much information as I can.

like image 996
macha Avatar asked Nov 09 '22 22:11

macha


1 Answers

The problem occurred because of nimbus, supervisor is not able to find the connection on the port given to supervisor to connect on.

like image 151
Kshitij Kulshrestha Avatar answered Nov 15 '22 11:11

Kshitij Kulshrestha