I'm running a Hadoop job over 1,5 TB of data with doing much pattern matching. I have several machines with 16GB RAM each, and I always get OutOfMemoryException
on this job with this data (I'm using Hive).
I would like to know how to optimally set option HADOOP_HEAPSIZE
in file hadoop-env.sh
so, my job would not fail. Is it even possible, to set this option so my jobs won't fail?
When I set HADOOP_HEAPSIZE
to 1,5 GB and removed half of pattern matching from query, job run successfully. So what is this option for, if it doesn't help avoiding job failures?
I ment to do more experimenting with optimal setup, but since those jobs take >10hr to run, I'm asking for your advice.
If your process attempts to use more than the maximum value, Hive kills the process and throws the OutOfMemoryError exception. To resolve this issue, increase the -Xmx value in the Hive shell script (in MB), and then run your Hive query again.
The size of the heap can vary, so many users restrict the Java heap size to 2-8 GB in order to minimize garbage collection pauses.
HADOOP_HEAPSIZE sets the JVM heap size for all Hadoop project servers such as HDFS, YARN, and MapReduce. HADOOP_HEAPSIZE is an integer passed to the JVM as the maximum memory (Xmx) argument. For example: HADOOP_HEAPSIZE=1024.
Is the Job failing or is your server crashing? If your Job is failing because of OutOfMemmory on nodes you can tweek your number of max maps and reducers and the JVM opts for each so that will never happen. mapred.child.java.opts (the default is 200Xmx) usually has to be increased based on your data nodes specific hardware.
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
Max tasks can be setup on the Namenode or overridden (and set final) on data nodes that may have different hardware configurations. The max tasks are setup for both mappers and reducers. To calculate this it is based on CPU (cores) and the amount of RAM you have and also the JVM max you setup in mapred.child.java.opts (the default is 200). The Datanode and Tasktracker each are set to 1GB so for a 8GB machine the mapred.tasktracker.map.tasks.maximum could be set to 7 and the mapred.tasktracker.reduce.tasks.maximum set to 7 with the mapred.child.java.opts set to -400Xmx (assuming 8 cores). Please note these task maxes are as much done by your CPU if you only have 1 CPU with 1 core then it is time to get new hardware for your data node or set the mask tasks to 1. If you have 1 CPU with 4 cores then setting map to 3 and reduce to 3 would be good (saving 1 core for the daemon).
By default there is only one reducer and you need to configure mapred.reduce.tasks to be more than one. This value should be somewhere between .95 and 1.75 times the number of maximum tasks per node times the number of data nodes. So if you have 3 data nodes and it is setup max tasks of 7 then configure this between 25 and 36.
If your server is crashing with OutOfMemory issues then that is where the HADOOP_HEAPSIZE comes in just for the processes heap (not the execution of task).
Lastly, if your Job is taking that long you can check to see if you have another good configuration addition is mapred.compress.map.output. Setting this value to true should (balance between the time to compress vs transfer) speed up the reducers copy greatly especially when working with large data sets. Often jobs do take time but there are also options to tweak to help speed things up =8^)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With