Hadoop nodes die (crash) after a while

Tags:

I have a hadoop cluster of 16 (ubuntu 12.04 server) nodes (1 master and 15 slaves). They are connected through a private network and the master also has a public IP (it belongs to two networks). When I run small tasks, i.e., with small input and small processing time, everything works. However, when I run bigger tasks, i.e. with 7-8 GB of input data, my slave nodes start dying one after another.

From the web ui (http://master:50070/dfsnodelist.jsp?whatNodes=LIVE) I see that the last contact starts increasing and from my cluster provider's web ui, I see that the nodes have crashed. Here is the screenshot of a node (I cannot scroll up):

enter image description here

Another machine showed this error, with hadoop dfs running, while no job was running:

BUG: soft lockup - CPU#7 stuck for 27s! [java:4072]

and

BUG: soft lockup - CPU#5 stuck for 41s! [java:3309]
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
         res 40/00:02:00:08:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata2.00: status: { DRDY }

Here is another screenshot (out of which I cannot make any sense):

enter image description here

Here is the log of a crashed datanode (with IP 192.168.0.9):

2014-02-01 15:17:34,874 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blk_-2375077065158517857_1818 src: /192.168.0.7:53632 dest: /192.168.0.9:50010
2014-02-01 15:20:14,187 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for blk_-2375077065158517857_1818 java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-2375077065158517857_1818 0 : Thread is interrupted.
2014-02-01 15:20:17,556 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for blk_-2375077065158517857_1818 terminating
2014-02-01 15:20:17,557 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-2375077065158517857_1818 received exception java.io.EOFException: while trying to read 65557 bytes
2014-02-01 15:20:17,560 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:296)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:340)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:582)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:404)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
    at java.lang.Thread.run(Thread.java:744)
2014-02-01 15:21:48,350 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60853, bytes: 132096, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000018_0_1657459557_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-6962923875569811947_1279, duration: 276262265702
2014-02-01 15:21:56,707 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.19:60849, bytes: 792576, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000013_0_1311506552_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_4630218397829850426_1316, duration: 289841363522
2014-02-01 15:23:46,614 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call getProtocolVersion(org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol, 3) from 192.168.0.19:48460: output error
2014-02-01 15:23:46,617 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020 caught: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
    at org.apache.hadoop.ipc.Server.channelWrite(Server.java:1756)
    at org.apache.hadoop.ipc.Server.access$2000(Server.java:97)
    at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:780)
    at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:844)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1472)
2014-02-01 15:24:26,800 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36391, bytes: 10821100, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000084_0_-2100756773_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_496206494030330170_1187, duration: 439385255122
2014-02-01 15:27:11,871 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.20:32913, bytes: 462336, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000004_0_-1095467656_1, offset: 19968, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-7029660283973842017_1326, duration: 205748392367
2014-02-01 15:27:57,144 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.9:36393, bytes: 10865080, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000033_0_-1409402881_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_-8749840347184507986_1447, duration: 649481124760
2014-02-01 15:28:47,945 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded blk_887028200097641216_1396
2014-02-01 15:30:17,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.9:50010, dest: /192.168.0.8:58304, bytes: 10743459, op: HDFS_READ, cliID: DFSClient_attempt_201402011511_0001_m_000202_0_1200991434_1, offset: 0, srvID: DS-271028747-192.168.0.9-50010-1391093674214, blockid: blk_887028200097641216_1396, duration: 69130787562
2014-02-01 15:32:05,208 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-7029660283973842017_1326 to 192.168.0.8:50010
2014-02-01 15:32:55,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.0.9:50010, storageID=DS-271028747-192.168.0.9-50010-1391093674214, infoPort=50075, ipcPort=50020) Starting thread to transfer blk_-34479901

Here is how my mapred-site.xml files are setup:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
</property>

<property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>4</value>
</property>

Each node has 8 CPUs and 8GB of RAM. I know that I have set mapred.child.java.opts too high,but with these settings and data the same jobs used to run. I have set reduce slowstart to 1.0, so reducers start only after all the mappers have finished.

Pinging some nodes results in a small percentage of packets lost and ssh connection freezes for a while, but I don't know if it is relevant. I have added to the /etc/security/limits.conf file on each node the line:

hadoop hard nofile 16384

but that didn't work either.

SOLUTION: It seems that after all, it was a memory error, indeed. I had too many running tasks and the computers crashed. After they crashed and I rebooted them, the hadoop jobs did not run correclty, even if I set the correct number of mappers. The solution was to remove bad datanodes (through decommissioning) and then include them again. Here is what I did and everything worked perfectly again, without losing any data:

How do I correctly remove nodes in Hadoop?

And of course, set the right number of max map and reduce tasks per node.

471

asked Jan 31 '14 10:01

vefthym

2 Answers

Here you're running out of memory as per mapped u have 2GB RAM and 4 maps are allowed.

Please try to run the same job with 1 GB xmx it will surely work out.

If u wanna use your cluster efficiently set the xmx according to block size of files.

If your block is of 128 Mb then 512 mb is sufficient.

187

answered Oct 02 '22 15:10

Vikas Hardia

does the job have a combiner step between M and R? I have had a problem with high mem Map and Combine steps happening at the same time on same nodes in tasks that take a lot of memory. With your config if 2 maps and 2 combines are happening you could be using 8G ram if you have large objects in memory. just a thought.

answered Oct 02 '22 15:10

Mark Giaconia

Related questions
                            
                                How to install the dns-sd command line test tool on Windows or Linux?
                            
                                Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded
                            
                                TCP - What if client call close() before server accept()
                            
                                How to programmatically edit the routing table
                            
                                What is the best compression library for very small amounts of data (3-4 kib?)
                            
                                Directory on another machine - Login credentials
                            
                                Low level networking in assembler (x86 compatible)
                            
                                Socket getting created with same IP and port on local host
                            
                                How is event driven programming implemented?
                            
                                Android networking
                            
                                iOS network reachability - doesn't seem to be working
                            
                                Java networking: evented Socket/InputStream
                            
                                WNetAddConnection2 returns 1219
                            
                                What do windows "interface names" look like?
                            
                                Does Path.Combine ever make network calls in .net?
                            
                                can not route packets from one interface to another [closed]
                            
                                Why do I need java.net.preferIPv4Stack=true only on some windows 7 systems?
                            
                                Comparison between HTTP and RPC
                            
                                Has server opened socket for each client process?
                            
                                How to programmatically write IP packet bytes into PCAP file format? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop nodes die (crash) after a while

Tags:

networking

ubuntu

hadoop

cluster-computing

vefthym

People also ask

2 Answers

Vikas Hardia

Mark Giaconia

Recent Activity

Donate For Us