I have a 4 node (master + 3 slave) cluster running hadoop 0.20.203.0. Every few days, datanodes will become reported as dead on the master. On the slave, everything appears fine and the datanode process is still running, with nothing suspicious in the logs, although it is no longer receiving any requests. On the master, the logs show that the datanode heartbeat has been lost.
The only solution is to manually stop the datanode and then start it again. After several minutes the datanode becomes reported as live again.
Has anyone else experienced this? If so what was the cause, and the solution?
We had similar problem, for us solusion was to increase open file limit.
Try add line like ulimit -n 4096 to file hadoop-env.sh
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With