How to optimize shuffling/sorting phase in a hadoop job

Tags:

I'm doing some data preparation using a single node hadoop job. The mapper/combiner in my job outputs many keys (more than 5M or 6M) and obviously the job proceeds slowly or even fails. The mapping phase runs up to 120 mapper and there is just one reducer (these are automatically determined and I've not set any values for them). I want to optimize the job so that shuffling/sorting phase occurs more efficiently. I increased mapreduce.task.io.sort.mb to 300m but the job faile because its value was larger than mapper heap. I then set mapred.child.java.opts to -Xmx1024m but it again failed because it couldn't initialize an Output Collector. What are the best practices for these scenarios?

372

asked Dec 09 '15 18:12

HHH

1 Answers

To begin with, since you are using a single node cluster, there is not much optimization you can do. You will have limited number of containers/slots on a single node cluster and for the amount of data you are processing (5 to 6 million keys), your jobs will always run slow and may also fail.

I am going to answer this question for a fully distributed Hadoop setup. There is a section ("Shuffle and Sort") in the book "Hadoop The Definitive Guide", which you should read for tuning the Shuffle and Sort phase. My answer is mainly influenced by the contents of this section and also my own experience with tuning the MapReduce jobs.

You can do the following to achieve the Shuffle and Sort efficiency:

Combiner: Using combiner will reduce the amount of data transferred to each of the the reducers, since combiner merges the output on the mapper side.
Number of reducers: Choose optimal number of reducers. If data size is huge, then one reducer is not a good idea. Also, setting the number of reducers to a high number, is not a good idea, since the number of reducers also determines the number of partitions on the mapper side. Look at the link here: https://github.com/paulhoule/infovore/wiki/Choosing-the-number-of-reducers
When to start the reducers:; You can control, when the reduce tasks are started. This is determined by configuration mapreduce.job.reduce.slowstart.completedmaps in YARN. It will not start the reducers until a certain percentage of mappers are completed. It is by default set to "0.05" (It means reducers start after 5% of mappers are completed). If the reducers are started early, then most of the reducers are idle, till all the mappers are completed. Also, the reducers may consume the slots, which could otherwise be used by the mappers for processing. By controlling this, you can use the mapper/reducers slots optimally and improve the time spent during the shuffle.
Compress Mapper Output: Its recommended to compress the mapper outputs (determined by configuration: mapreduce.map.output.compress), so that lesser data gets written to disk and gets transferred to reducers.
Tune config "mapreduce.task.io.sort.mb": Increase the buffer size used by the mappers during the sorting. This will reduce the number of spills to the disk.
Tune config "mapreduce.reduce.input.buffer.percent": If your reduce task has lesser memory requirements, then this value can be set to a high percentage. This means, higher amount of heap is used for retaining the map outputs during the reduce phase (after the shuffle phase), thus reducing the number of spills to disk.
Tune config "mapreduce.reduce.shuffle.parallelcopies": Number of threads used to copy map outputs to reducers. Check the link here: how to tune mapred.reduce.parallel.copies?

Following are the other configuration parameters which can be tuned to improve the Shuffle and Sort phase performance (see the description of these configurations here: https://hadoop.apache.org/docs/r2.4.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml):

mapreduce.map.sort.spill.percent: Determines the threshold for the in memory buffer used by the mapper. When this threshold is reached, the the contents of the buffer are spilled to disk. So this value determines the number of spills to disk
mapreduce.task.io.sort.factor: Minimum number of streams to be merged at once, during sorting. So, on the reducer side, if there are 50 mapper outputs and this value is set to 10, then there will be 5 rounds of merging (on an average 10 files for merge round).
mapreduce.shuffle.max.threads: Number of worker threads for copying the map outputs to reducers.
mapreduce.reduce.shuffle.input.buffer.percent: How much of heap should be used for storing the map output, during the shuffle phase in the reducer. This setting determines the amount of mapper output that can be held in memory, before it is spilled to disk.
mapreduce.reduce.shuffle.merge.percent: Threshold for starting the process of merge and spilling to disk
mapreduce.reduce.merge.inmem.threshold: Number of map outputs needed for starting the merge process. When either mapreduce.reduce.shuffle.merge.percent or mapreduce.reduce.merge.inmem.threshold is reached, then the map outputs are merged and spilled to disk.

answered Sep 21 '22 17:09

Manjunath Ballur

Related questions
                            
                                How Java Hadoop Mapper can send multiple values
                            
                                HDFS error put: `input': No such file or directory
                            
                                Apache Hadoop vs Google Bigdata
                            
                                Hadoop Reducer Values in Memory?
                            
                                Loading csv data into Hbase [closed]
                            
                                Difference between 3 memory parameters in Hadoop 2?
                            
                                Create HIVE Table with multi character delimiter
                            
                                How to increase the number of containers in nodemanager in YARN
                            
                                Hadoop: Number of mappers and reducers
                            
                                Hadoop2.2.0 can't visit the web http://<ip>:8088
                            
                                Fine grained transformation vs coarse grained transformations
                            
                                How do I get last modified date from a Hadoop Sequence File?
                            
                                Initialization failed for Block pool <registering> (Datanode Uuid unassigned)
                            
                                How can I use proto3 with Hadoop/Spark?
                            
                                Reading in csv file as dataframe from hdfs
                            
                                number of reducers for 1 task in MapReduce
                            
                                How to run Hbase Java example?
                            
                                HDFS Reduced Replication Factor
                            
                                Which files are ignored as input by mapper?
                            
                                Difference between fs.defaultFS and fs.default.name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to optimize shuffling/sorting phase in a hadoop job

Tags:

hadoop

hadoop2

mapreduce

HHH

People also ask

1 Answers

Manjunath Ballur

Recent Activity

Donate For Us