Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop streaming job using Mxnet failing in AWS Emr

I have setup an emr step in AWS datapipeline. The step command looks like this:

/usr/lib/hadoop-mapreduce/hadoop-streaming.jar,-input,s3n://input-bucket/input-file,-output,s3://output/output-dir,-mapper,/bin/cat,-reducer,reducer.py,-file,/scripts/reducer.py,-file,/params/parameters.bin

I am getting the following error

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
    at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:393)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

I have tried running reducer step separately on my desktop (on a single node hadoop setup) and its working. I have already included the #!/usr/bin/env python in the reducer script. I suspect that I am not writing the EMR step correctly.

EMR version: 5.5.0

EDIT: After further investigation, I have found out the exact line of code where the reducer code is failing in emr. I am doing Machine Learning predictions using mxnet library in the reducer. When I load the model parameters, the reducer fails. Reference to API doc is here

module.load_params('parameters.bin')

I have checked the current working directory of the EMR node [using os.listdir(os.getcwd())] and it contains the parameters.bin file (I have even printed the file contents successfully). I want to point out again that the streaming job is working fine on my single-node local setup.

EDIT2: I set the number of reducer tasks to 2. I enclosed my reducer code in a try-except block and I see the following error in one of the tasks (the other one runs fine)

[10:27:25] src/ndarray/ndarray.cc:299: Check failed: from.shape() == to->shape() operands shape mismatchfrom.shape = (119,) to.shape=(111,) 

Stack trace returned 10 entries:    
[bt] (0) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc72fc) [0x7f81443842fc]    
[bt] (1) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc166f4) [0x7f8144ed36f4]   
[bt] (2) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0xc74c24) [0x7f8144f31c24]   
[bt] (3) /usr/local/lib/python2.7/site-packages/mxnet/libmxnet.so(MXImperativeInvoke+0x2cd) [0x7f8144db935d]    
[bt] (4) /usr/lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7f8150b8acec]  
[bt] (5) /usr/lib64/libffi.so.6(ffi_call+0x1f5) [0x7f8150b8a615]    
[bt] (6) /usr/lib64/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x30b) [0x7f8150d9d97b]   
[bt] (7) /usr/lib64/python2.7/lib-dynload/_ctypes.so(+0xa915) [0x7f8150d97915]  
[bt] (8) /usr/lib64/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f815a69e183]    
[bt] (9) /usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x337d) [0x7f815a73107d] 
like image 792
ishan3243 Avatar asked Nov 08 '22 21:11

ishan3243


1 Answers

I figured out the issue. Actually, the shapes expected by mxnet were dependent on the data-set (it actually depended on the maximum value in the data-set). Training happens on a single gpu box and has the whole dataset. The prediction however, works well with single node setup because it has all the data used in training. But when multi-node cluster is used, the data-set gets split which made the max-value different for each node. This was causing the error.

I have now made the expected shapes independent of the data-set and this error is not occurring anymore. I hope this clarifies things.

like image 75
ishan3243 Avatar answered Dec 11 '22 01:12

ishan3243