In Hadoop, if we have not set number of reducers, then how many number of reducers will be created? Like number of mappers is dependent on (total data size)/(input split size), E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand). The number of reducer is dependent on which factors ? How many reducers are created for a job?

By default the no of reducers is set to 1. You can change it by adding a parameter <code>mapred.reduce.tasks</code> in the command line or in the Driver code or in the conf file that you pass. e.g: Command Line Argument: <code>bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks></code> or, in Driver code as: <code>conf.setNumReduceTasks(int num);</code> Recommended read: https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Default number of reducers

Tags:

hadoop

mapreduce

hdfs

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created?

Like number of mappers is dependent on (total data size)/(input split size), E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand).

The number of reducer is dependent on which factors ? How many reducers are created for a job?

466

asked Jan 10 '16 07:01

Mohit Jain

2 Answers

How Many Reduces? ( From official documentation)

The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

This article covers about Mapper count too.

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

If you want to change the default value of 1 for number of reducers, you can set below property (From hadoop 2.x version) as a command line parameter

mapreduce.job.reduces

you can set programmatically with

job.setNumReduceTasks(integer_numer);

Have a look at one more related SE question: What is Ideal number of reducers on Hadoop?

answered Oct 10 '22 07:10

Ravindra babu

By default the no of reducers is set to 1.

You can change it by adding a parameter

mapred.reduce.tasks in the command line or in the Driver code or in the conf file that you pass.

e.g: Command Line Argument: bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks> or, in Driver code as: conf.setNumReduceTasks(int num);

Recommended read: https://wiki.apache.org/hadoop/HowManyMapsAndReduces

answered Oct 10 '22 07:10

Koustav Ray

Related questions
                            
                                Join vs COGROUP in PIG
                            
                                How to allow spark to ignore missing input files?
                            
                                Any way to compute statistics on a hive table for all partitions with a single analyze command?
                            
                                Spark 2.2.0 FileOutputCommitter
                            
                                Apache Spark-SQL vs Sqoop benchmarking while transferring data from RDBMS to hdfs
                            
                                Where do I start with distributed computing?
                            
                                How to decompress the hadoop reduce output file end with snappy?
                            
                                Load only particular field in PIG?
                            
                                what is /tmp directory in hadoop hdfs?
                            
                                How to convert an 500GB SQL table into Apache Parquet?
                            
                                Hbase client ConnectionLoss for /hbase error
                            
                                Connect to S3 data from PySpark
                            
                                spark over kubernetes vs yarn/hadoop ecosystem [closed]
                            
                                how to sort word count by value in hadoop? [duplicate]
                            
                                What is the path to directory within Hadoop filesystem?
                            
                                Streaming data and Hadoop? (not Hadoop Streaming)
                            
                                Output a list from a Hadoop Map Reduce job using custom writable
                            
                                Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
                            
                                Hadoop HDFS copy with wildcards?
                            
                                Hive error: parseexception missing EOF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With