When do reduce tasks start in Hadoop?

2 Answers

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.

Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).

Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

173

answered Oct 13 '22 06:10

Donald Miner

The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by @Donald-miner above.

answered Oct 13 '22 04:10

javadevg

Related questions
                            
                                HDFS free space available command
                            
                                How to fix corrupt HDFS FIles
                            
                                Hive cluster by vs order by vs sort by
                            
                                Why is there no 'hadoop fs -head' shell command?
                            
                                Hive insert query like SQL
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Hive: how to show all partitions of a table?
                            
                                HDFS error: could only be replicated to 0 nodes, instead of 1
                            
                                Integration testing Hive jobs
                            
                                How to Delete a directory from Hadoop cluster which is having comma(,) in its name?
                            
                                Differences between Amazon S3 and S3n in Hadoop
                            
                                How to delete and update a record in Hive
                            
                                What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
                            
                                Is there any way to get the column name along with the output while execute any query in Hive?
                            
                                Buiding Hadoop with Eclipse / Maven - Missing artifact jdk.tools:jdk.tools:jar:1.6
                            
                                Where does Hive store files in HDFS?
                            
                                merge output files after reduce phase
                            
                                hadoop copy a local file system folder to HDFS
                            
                                Hadoop truncated/inconsistent counter name
                            
                                How to check if ZooKeeper is running or up from command prompt?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When do reduce tasks start in Hadoop?

Tags:

reduce

hadoop

mapreduce

Slayer

People also ask

2 Answers

Donald Miner

javadevg

Recent Activity

Donate For Us