Setting the number of map tasks and reduce tasks

Tags:

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong. I am using this command

hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0

Output:

11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164 11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18 11/07/30 19:48:56 INFO mapred.JobClient:   Job Counters  11/07/30 19:48:56 INFO mapred.JobClient:     Launched reduce tasks=13 11/07/30 19:48:56 INFO mapred.JobClient:     Rack-local map tasks=12 11/07/30 19:48:56 INFO mapred.JobClient:     Launched map tasks=24 11/07/30 19:48:56 INFO mapred.JobClient:     Data-local map tasks=12 11/07/30 19:48:56 INFO mapred.JobClient:   FileSystemCounters 11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_READ=4020792636 11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_READ=1556534680 11/07/30 19:48:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6026699058 11/07/30 19:48:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1928893942 11/07/30 19:48:56 INFO mapred.JobClient:   Map-Reduce Framework 11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input groups=40000000 11/07/30 19:48:56 INFO mapred.JobClient:     Combine output records=0 11/07/30 19:48:56 INFO mapred.JobClient:     Map input records=40000000 11/07/30 19:48:56 INFO mapred.JobClient:     Reduce shuffle bytes=1974162269 11/07/30 19:48:56 INFO mapred.JobClient:     Reduce output records=40000000 11/07/30 19:48:56 INFO mapred.JobClient:     Spilled Records=120000000 11/07/30 19:48:56 INFO mapred.JobClient:     Map output bytes=1928893942 11/07/30 19:48:56 INFO mapred.JobClient:     Combine input records=0 11/07/30 19:48:56 INFO mapred.JobClient:     Map output records=40000000 11/07/30 19:48:56 INFO mapred.JobClient:     Reduce input records=40000000 [hcrc1425n30]s0907855:

683

asked Jul 30 '11 19:07

asembereng

1 Answers

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.

In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.

Also, removing a space after -D might solve the problem for reduce.

For more information on the number of map and reduce tasks, please look at the below url

https://cwiki.apache.org/confluence/display/HADOOP2/HowManyMapsAndReduces

answered Sep 25 '22 12:09

Praveen Sripati

Related questions
                            
                                Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?
                            
                                Spark Scala list folders in directory
                            
                                Loading Data from a .txt file to Table Stored as ORC in Hive
                            
                                When using --negotiate with curl, is a keytab file required?
                            
                                view contents of file in hdfs hadoop
                            
                                List the namenode and datanodes of a cluster from any node?
                            
                                HBase REST Filter ( SingleColumnValueFilter )
                            
                                Why isn't Hadoop implemented using MPI?
                            
                                How do you make a HIVE table out of JSON data?
                            
                                Download large data for Hadoop [closed]
                            
                                What is the relationship between Spark, Hadoop and Cassandra
                            
                                Cannot Read a file from HDFS using Spark
                            
                                How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.? [closed]
                            
                                How do I get schema / column names from parquet file?
                            
                                How does Hadoop perform input splits?
                            
                                Why do we need ZooKeeper in the Hadoop stack?
                            
                                Ports are not available: listen tcp 0.0.0.0/50070: bind: An attempt was made to access a socket in a way forbidden by its access permissions
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Thrift, Avro, Protocolbuffers - Are they all dead?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Setting the number of map tasks and reduce tasks

Tags:

hadoop

mapreduce

asembereng

People also ask

1 Answers

Praveen Sripati

Recent Activity

Donate For Us