how to tune mapred.reduce.parallel.copies?

2 Answers

Reaching the "sweet spot" is really just finding the parameters that give you the best result for whichever metric you consider the most important, usually overall job time. To figure out what parameters are working I would suggest using the following profiling tools that Hadoop comes with, MrBench, TestDFSIO, and NNBench. These are found in the hadoop-mapreduce-client-jobclient-*.jar.

By running this command you will see a long list of benchmark programs that you can use besides the ones I mentioned above.

hadoop ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar

I would suggest running with the default parameters, run tests to give baseline benchmarks, then changing one parameter and rerunning. A bit time consuming but worth it, especially if you use a script to change parameters and run the benchmarks.

answered Nov 28 '22 04:11

greedybuddha

In order to do that you should basically look for 4 things : CPU, RAM, Disk and Network. If your setup is crossing the threshold of these metrics you can deduce that you are pushing the limits. For example, if you have set the value of "mapred.reduce.parallel.copies" to a value much higher than the number of cores available, you'll end up with too many threads in waiting state, as based on this property Threads will be created to fetch the Map output. In addition to that network might get overwhelmed. Or, if there is too much intermediate output to be shuffled , your job will become slow as you will need disk based shuffle in such a case, which will be slower than RAM based shuffle. Choose a wise value for "mapred.job.shuffle.input.buffer.percent" based on your RAM(defaults to 70% of Reducer heap, which is normally good). So, these are kinda things which will tell you whether you are over-parallelizing or not. There are a lot of other things as well which you should consider. I would recommend you to go through the Chapter 6 of "Hadoop Definitve Guide".

Some of the measures which you could take, in order to make your jobs efficient, are like using a combiner to limit the data transfer, enable intermediate compression etc.

HTH

P.S : The answer is not very specific to just "mapred.reduce.parallel.copies". It tells you about tuning your job in general. Actually speaking setting only this property is not gonna help you much. You should consider other important properties as well.

answered Nov 28 '22 03:11

Tariq

Related questions
                            
                                How to efficiently store and query a billion rows of sensor data
                            
                                How to get the value for a variable key from a pig map?
                            
                                Creating parquet files in spark with row-group size that is less than 100
                            
                                Java Keystore PrivateKeyEntry vs trustedCertEntry
                            
                                Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?
                            
                                Specifying memory limits with hadoop
                            
                                Hadoop: How does OutputCollector work during MapReduce?
                            
                                Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed
                            
                                Spark forcing log4j
                            
                                How to change user in hdfs using sparkSubmit in java
                            
                                S3 and EMR data locality [closed]
                            
                                Is "Adopting MapReduce model" = Universal answer to scalability?
                            
                                What is the closest thing to Apache Hadoop in other languages?
                            
                                "GC Overhead limit exceeded" on Hadoop .20 datanode
                            
                                Simple oozie example of hive query?
                            
                                Pig, how to refer to a field after a join and a group by
                            
                                In Hive, how can I add a column only if that column does not exist?
                            
                                Should the HBase region server and Hadoop data node on the same machine?
                            
                                Hadoop 2.6 Connecting to ResourceManager at /0.0.0.0:8032
                            
                                could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to tune mapred.reduce.parallel.copies?

Tags:

hadoop

ihadanny

People also ask

2 Answers

greedybuddha

Tariq

Recent Activity

Donate For Us