I've been getting the following error in several cases:
2017-03-23 11:55:10,794 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1490079327128_0048_r_000003_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
I noticed it happens one large sorts but when i change the "Sort Allocation Memory" it does not help.
I tried changing other memory properties but yet, the solution eludes me. Is there a good explanation to how Mapreduce works and what's the interaction between the different components? What should I change? where do I locate the Java error leading to this?
You will see the exit code 143 in your logs because the container is terminating gracefully with SIGTERM, but there are many cases in which Kubernetes needs to shut down a pod.
Exit Code 143 means that the container received a SIGTERM signal from the operating system, which asks the container to gracefully terminate, and the container succeeded in gracefully terminating (otherwise you will see Exit Code 137).
JVM error code 143 means Internal field must be valid.
Exit code 137 is triggered when a pod or a container within your Kubernetes environment exceeds the amount of memory that they're assigned. Typically, this exit code is accompanied by or simply known as OOMKilled. Killed in this description refers to the result of this error, which causes a pod to terminate.
Exit code 143 is related to Memory/GC issues. Your default Mapper/reducer memory setting may not be sufficient to run the large data set. Thus, try setting up higher AM, MAP and REDUCER memory when a large yarn job is invoked.
Please check this link out: https://community.hortonworks.com/questions/96183/help-troubleshoot-container-killed-by-the-applicat.html
Please look into: https://www.slideshare.net/SparkSummit/top-5-mistakes-when-writing-spark-applications-63071421
Excellent source to optimize your code.
I found out I mixed up two separate things. The 143 exit code is from the metrics collector which is down. The Jobs are killed, as far as I understand, due to no memory issues. The problem is with large window functions that cant reduce the data till the last one which contains all the data.
Although, the place in the logs where it gives the reason why the job was killed, still eludes me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With