Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storm worker died after running for 2 hours

Tags:

apache-storm

I write a topology to read topic from kKafka and then do some aggregation and then store the result in database. The topology was running fine for several hours, but then the worker died and eventually the supervisor died as well. This issue occurs every time after running for several hours.

I'm running storm 0.9.5 on 3 nodes (1 for nimbus, 2 for workers).

This the the error I got in one of the worker's log:

2015-08-12T04:10:38.395+0000 b.s.m.n.Client [ERROR] connection attempt 101 to Netty-Client-/10.28.18.213:6700 failed: java.lang.RuntimeException: Returned channel was actually not established
2015-08-12T04:10:38.395+0000 b.s.m.n.Client [INFO] closing Netty Client Netty-Client-/10.28.18.213:6700
2015-08-12T04:10:38.395+0000 b.s.m.n.Client [INFO] waiting up to 600000 ms to send 0 pending messages to Netty-Client-/10.28.18.213:6700
2015-08-12T04:10:38.404+0000 STDIO [ERROR] Aug 12, 2015 4:10:38 AM org.apache.storm.guava.util.concurrent.ExecutionList executeListener
SEVERE: RuntimeException while executing runnable org.apache.storm.guava.util.concurrent.Futures$4@632ef20f with executor org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService@1f15e9a8
java.lang.RuntimeException: Failed to connect to Netty-Client-/10.28.18.213:6700
        at backtype.storm.messaging.netty.Client.connect(Client.java:308)
        at backtype.storm.messaging.netty.Client.access$1100(Client.java:78)
        at backtype.storm.messaging.netty.Client$2.reconnectAgain(Client.java:297)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:283)
        at backtype.storm.messaging.netty.Client$2.onSuccess(Client.java:275)
        at org.apache.storm.guava.util.concurrent.Futures$4.run(Futures.java:1181)
        at org.apache.storm.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at org.apache.storm.guava.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
        at org.apache.storm.guava.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
        at org.apache.storm.guava.util.concurrent.ListenableFutureTask.done(ListenableFutureTask.java:91)
        at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:380)
        at java.util.concurrent.FutureTask.set(FutureTask.java:229)
        at java.util.concurrent.FutureTask.run(FutureTask.java:270)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Giving up to connect to Netty-Client-/10.28.18.213:6700 after 102 failed attempts
        at backtype.storm.messaging.netty.Client.connect(Client.java:303)
        ... 19 more

And this is my configuration for each worker node:

storm.zookeeper.servers:
- "10.28.19.230"
- "10.28.19.224"
- "10.28.19.223"
storm.zookeeper.port: 2181
nimbus.host: "10.28.18.211"
storm.local.dir: "/mnt/storm/storm-data"
storm.local.hostname: "10.28.18.213"
storm.messaging.transport: backtype.storm.messaging.netty.Context
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1
storm.messaging.netty.buffer_size: 5242880
storm.messaging.netty.max_retries: 300
storm.messaging.netty.max_wait_ms: 4000
storm.messaging.netty.min_wait_ms: 100
supervisor.slots.ports:
- 6700

supervisor.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346
#worker.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=1%ID%"
#supervisor.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=12346"
worker.childopts: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=2%ID% -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx10240m -Xms10240m -XX:MaxNewSize=6144m
like image 449
Jilong Avatar asked Mar 16 '23 02:03

Jilong


1 Answers

I think I found the root cause base on this thread: https://mail-archives.apache.org/mod_mbox/storm-user/201402.mbox/%[email protected]%3E

The netty client error are only symptoms, but "the root cause was that GC logging for workers had been turned on without specifying an output file. As a result the GC logging went to standard output without being redirected to logback. Eventually the buffer filled and the JVM would hang (and thus stop heartbeating) and be killed. The amount of time the worker would last depended on memory pressure and allocated heap size (obviously, in hindsight the more GC, the more GC logging, the faster the buffer fills)."

Hope this can help you guys.

like image 67
Jilong Avatar answered Mar 19 '23 21:03

Jilong