Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop nodes die (crash) after a while

I have a hadoop cluster of 16 (ubuntu 12.04 server) nodes (1 master and 15 slaves). They are connected through a private network and the master also has a public IP (it belongs to two networks). When I run small tasks, i.e., with small input and small processing time, everything works. However, when I run bigger tasks, i.e. with 7-8 GB of input data, my slave nodes start dying one after another.

From the web ui (http://master:50070/dfsnodelist.jsp?whatNodes=LIVE) I see that the last contact starts increasing and from my cluster provider's web ui, I see that the nodes have crashed. Here is the screenshot of a node (I cannot scroll up):

enter image description here

Another machine showed this error, with hadoop dfs running, while no job was running:

BUG: soft lockup - CPU#7 stuck for 27s! [java:4072]

and

BUG: soft lockup - CPU#5 stuck for 41s! [java:3309]
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
         res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata2.00: status: { DRDY }

Here is another screenshot (out of which I cannot make any sense):

enter image description here

Here is the log of a crashed datanode (with IP 192.168.0.9):

2014-02-01 15:17:34,874 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blk_-2375077065158517857_1818 src: /192.168.0.7:53632 dest: /192.168.0.9:50010
2014-02-01 15:20:14,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for blk_-2375077065158517857_1818 java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-2375077065158517857_1818 0 : Thread is interrupted.
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for blk_-2375077065158517857_1818 terminating
2014-02-01 15:20:17,557 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-2375077065158517857_1818 received exception java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,560 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:296)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:340)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:582)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
    at java.lang.Thread.run(Thread.java:744)
2014-02-01 15:21:48,350 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60853, bytes: 132096, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000018_0_1657459557_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-6962923875569811947_1279, duration: 276262265702
2014-02-01 15:21:56,707 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60849, bytes: 792576, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000013_0_1311506552_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_4630218397829850426_1316, duration: 289841363522
2014-02-01 15:23:46,614 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getProtocolVersion(org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol, 3) from 192.168.0.19:48460: output error
2014-02-01 15:23:46,617 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020 caught: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
    at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1756)
    at org.apache.hadoop.ipc.Server.access$2000(Server.java:97)
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:780)
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:844)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1472)
2014-02-01 15:24:26,800 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36391, bytes: 10821100, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000084_0_-2100756773_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_496206494030330170_1187, duration: 439385255122
2014-02-01 15:27:11,871 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.20:32913, bytes: 462336, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000004_0_-1095467656_1, offset: 19968, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-7029660283973842017_1326, duration: 205748392367
2014-02-01 15:27:57,144 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36393, bytes: 10865080, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000033_0_-1409402881_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-8749840347184507986_1447, duration: 649481124760
2014-02-01 15:28:47,945 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded blk_887028200097641216_1396
2014-02-01 15:30:17,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.8:58304, bytes: 10743459, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000202_0_1200991434_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_887028200097641216_1396, duration: 69130787562
2014-02-01 15:32:05,208 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-7029660283973842017_1326 to 192.168.0.8:50010
2014-02-01 15:32:55,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-34479901

Here is how my mapred-site.xml files are setup:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
</property>

<property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
</property>

Each node has 8 CPUs and 8GB of RAM. I know that I have set mapred.child.java.opts too high,but with these settings and data the same jobs used to run. I have set reduce slowstart to 1.0, so reducers start only after all the mappers have finished.

Pinging some nodes results in a small percentage of packets lost and ssh connection freezes for a while, but I don't know if it is relevant. I have added to the /etc/security/limits.conf file on each node the line:

hadoop hard nofile 16384

but that didn't work either.

SOLUTION: It seems that after all, it was a memory error, indeed. I had too many running tasks and the computers crashed. After they crashed and I rebooted them, the hadoop jobs did not run correclty, even if I set the correct number of mappers. The solution was to remove bad datanodes (through decommissioning) and then include them again. Here is what I did and everything worked perfectly again, without losing any data:

How do I correctly remove nodes in Hadoop?

And of course, set the right number of max map and reduce tasks per node.

like image 471
vefthym Avatar asked Jan 31 '14 10:01

vefthym


People also ask

What does Hadoop do with a task that crashes in a node?

Hadoop also solves data availability problem by creating replication of data so, in the case of any datanode being crashed or destroyed, data will not be lost.

How Hadoop cope up with node failure?

As soon as the datanodes are declared dead. Data blocks on the failed Datanode are replicated on other Datanodes based on the specified replication factor in hdfs-site. xml file. Once the failed datanodes comes back the Name node will manage the replication factor again.

Who helps Hadoop to cope up with node failures?

As soon as the data node is declared dead/non-functional all the data blocks it hosts are transferred to the other data nodes with which the blocks are replicated initially. This is how Namenode handles datanode failures.

What if master node fails in Hadoop?

Namenode also known as Master node is the linchpin of Hadoop. If namenode fails, your cluster is officially lost. To avoid this scenario, you must configure standby namenode.


2 Answers

Here you're running out of memory as per mapped u have 2GB RAM and 4 maps are allowed.

Please try to run the same job with 1 GB xmx it will surely work out.

If u wanna use your cluster efficiently set the xmx according to block size of files.

If your block is of 128 Mb then 512 mb is sufficient.

like image 187
Vikas Hardia Avatar answered Oct 02 '22 15:10

Vikas Hardia


does the job have a combiner step between M and R? I have had a problem with high mem Map and Combine steps happening at the same time on same nodes in tasks that take a lot of memory. With your config if 2 maps and 2 combines are happening you could be using 8G ram if you have large objects in memory. just a thought.

like image 27
Mark Giaconia Avatar answered Oct 02 '22 15:10

Mark Giaconia