Each time I run my Hadoop program I need to change the number of mappers and reducers. Is there any way to pass the number of mappers and reducers to my program from command line (when I run the program) and then used args
to retrieve it?
It is important to understand that you cannot really specify the number of map tasks. Ultimately the number of map tasks is defined as the number of input splits which is dependent on your InputFormat
implementation. Let's say you have 1TB of input data, and your HDFS block size is 64MB, so Hadoop will compute around 16k map tasks, and from there if you specify a manual value less than 16k it will be ignored, but more than 16k and it will be used.
To pass via command-line, the easiest way is to use the built-in class GenericOptionsParser
(described here) which will directly parse common command-line Hadoop-related arguments like what you are trying to do. The good thing is that it allows you to pass pretty much any Hadoop parameters you want without having to write extra code later. You would do something like this:
public static void main(String[] args) {
Configuration conf = new Configuration();
String extraArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs();
// do something with your non-Hadoop parameters if needed
}
Now the properties you need to define to modify the number of mappers and reducers are respectively mapred.map.tasks
and mapred.reduce.tasks
, so you can just run your job with these parameters:
-D mapred.map.tasks=42 -D mapred.reduce.tasks
and they will get directly parsed with your GenericOptionParser
and populate your Configuration
object automatically. Note that there is a space between the -D and the properties, this is important otherwise this will be interpreted as JVM parameters.
Here is a good link if you want to know more about this.
You can specify the number of mappers and reducers ( and really any parameter you can specify in the config), by using the -D
parameter. This works for all default Hadoop jars and your own jars as long as you extends Configured
.
hadoop jar myJar.jar -Dmapreduce.job.maps=<Number of maps> -Dmapreduce.job.reduces=<Number of reducers>
From there you can retreive the values using.
configuration.get("mapreduce.job.maps");
configuration.get("mapreduce.job.reduces");
or for Reducers
job.getNumReduceTasks();
Specifying the mappers with the configuration values will not work when mapreduce.jobtracker.address is "local"
. See Charles' answer where he explains how Hadoop usually determines the number of Mappers by data size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With