Hadoop streaming job failure: Task process exit with nonzero status of 137

Question

I've been banging my head on this one for a few days, and hope that somebody out there will have some insight.

I've written a streaming map reduce job in perl that is prone to having one or two reduce tasks take an extremely long time to execute. This is due to a natural asymmetry in the data: some of the reduce keys have over a million rows, where most have only a few dozen.

I've had problems with long tasks before, and I've been incrementing counters throughout to ensure that map reduce doesn't time them out. But now they are failing with an error message I hadn't seen before:

java.io.IOException: Task process exit with nonzero status of 137.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

This is not the standard timeout error message, but the error code 137 = 128+9 suggests that my reducer script received a kill -9 from Hadoop. The tasktracker log contains the following:

2011-09-05 19:18:31,269 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201109051336_0003_m_000029_1,7) failed :
org.mortbay.jetty.EofException
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
        at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
        at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
        at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
        at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2940)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
        at sun.nio.ch.IOUtil.write(IOUtil.java:43)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
        at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
        at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
        ... 24 more

2011-09-05 19:18:31,289 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.92.8.202:50060, dest: 10.92.8.201:46436, bytes: 7340032, op: MAPRED_SHUFFLE, cliID: attempt_201109051336_0003_m_000029_1
2011-09-05 19:18:31,292 ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
        at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
        at org.mortbay.jetty.Response.sendError(Response.java:240)
        at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2963)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

I've been looking around the forums, and it sounds like the Warnings are commonly found in jobs that run without error, and can usually be ignored. The error makes it look like the reducer lost contact with the map output, but I can't figure out why. Does anyone have any ideas?

Here's a potentially relevant fact: These long tasks were making my job take days where it should take minutes. Since I can live without the output from one or two keys, I tried to implement a ten minute timeout in my reducer as follows:

eval {  
        local $SIG{ALRM} = sub {
            print STDERR "Processing of new merchant names in $prev_merchant_zip timed out...
";
            print STDERR "reporter:counter:tags,failed_zips,1
";
            die "timeout";
        };

        alarm 600;

        #Code that could take a long time to execute

        alarm 0;
    };

This timeout code works like a charm when I test the script locally, but the strange mapreduce errors started after I introduced it. However, the failures seem to occur well after the first timeout, so I'm not sure if it's related.

Thanks in advance for any help!

cftarnas · Accepted Answer

Two possibilities come to mind:

RAM usage, if a tasks uses too much RAM the OS can kill it (after horrible swapping, etc).
Are you using any non-reentrant libraries? Maybe the timer is being triggered at an inopportune point in a library call.

Tombart · Answer

Exit code 137 is a typical sign of the infamous OOM killer. You can easily check it using dmesg command for messages like this:

[2094250.428153] CPU: 23 PID: 28108 Comm: node Tainted: G         C O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt20-1+deb8u2
[2094250.428155] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[2094250.428156]  ffff880773439400 ffffffff8150dacf ffff881328ea32f0 ffffffff8150b6e7
[2094250.428159]  ffff881328ea3808 0000000100000000 ffff88202cb30080 ffff881328ea32f0
[2094250.428162]  ffff88107fdf2f00 ffff88202cb30080 ffff88202cb30080 ffff881328ea32f0
[2094250.428164] Call Trace:
[2094250.428174]  [<ffffffff8150dacf>] ? dump_stack+0x41/0x51
[2094250.428177]  [<ffffffff8150b6e7>] ? dump_header+0x76/0x1e8
[2094250.428183]  [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428186]  [<ffffffff8114088d>] ? oom_kill_process+0x21d/0x370
[2094250.428188]  [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428193]  [<ffffffff811a053a>] ? mem_cgroup_oom_synchronize+0x52a/0x590
[2094250.428195]  [<ffffffff8119fac0>] ? mem_cgroup_try_charge_mm+0xa0/0xa0
[2094250.428199]  [<ffffffff81141040>] ? pagefault_out_of_memory+0x10/0x80
[2094250.428203]  [<ffffffff81057505>] ? __do_page_fault+0x3c5/0x4f0
[2094250.428208]  [<ffffffff8109d017>] ? put_prev_entity+0x57/0x350
[2094250.428211]  [<ffffffff8109be86>] ? set_next_entity+0x56/0x70
[2094250.428214]  [<ffffffff810a2c61>] ? pick_next_task_fair+0x6e1/0x820
[2094250.428219]  [<ffffffff810115dc>] ? __switch_to+0x15c/0x570
[2094250.428222]  [<ffffffff81515ce8>] ? page_fault+0x28/0x30

You can easily if OOM is enabled:

$ cat /proc/sys/vm/overcommit_memory
0

Basically OOM killer tries to kill process that eats largest part of memory. And you probably don't want to disable it:

The OOM killer can be completely disabled with the following command. This is not recommended for production environments, because if an out-of-memory condition does present itself, there could be unexpected behavior depending on the available system resources and configuration. This unexpected behavior could be anything from a kernel panic to a hang depending on the resources available to the kernel at the time of the OOM condition.

sysctl vm.overcommit_memory=2
echo "vm.overcommit_memory=2" >> /etc/sysctl.conf

Same situation can happen if you use e.g. cgroups for limiting memory. When process exceeds given limit it gets killed without warning.

Hadoop streaming job failure: Task process exit with nonzero status of 137

Tags:

hadoop

streaming

Alexander Hasha

2 Answers

cftarnas

Tombart

Recent Activity

Donate For Us

Hadoop streaming job failure: Task process exit with nonzero status of 137

Tags:

hadoop

streaming

Alexander Hasha

2 Answers

cftarnas

Tombart

Related questions

Recent Activity

Donate For Us