When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.
Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.
Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.
Here are some cases that can happen:
The storm supervisor logs restart by timeout. you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.
As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.
I still recommend increasing the default timeouts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With