Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass an argument to the main program in Hadoop

Each time I run my Hadoop program I need to change the number of mappers and reducers. Is there any way to pass the number of mappers and reducers to my program from command line (when I run the program) and then used args to retrieve it?

like image 477
HHH Avatar asked Feb 17 '23 10:02

HHH


2 Answers

It is important to understand that you cannot really specify the number of map tasks. Ultimately the number of map tasks is defined as the number of input splits which is dependent on your InputFormat implementation. Let's say you have 1TB of input data, and your HDFS block size is 64MB, so Hadoop will compute around 16k map tasks, and from there if you specify a manual value less than 16k it will be ignored, but more than 16k and it will be used.

To pass via command-line, the easiest way is to use the built-in class GenericOptionsParser (described here) which will directly parse common command-line Hadoop-related arguments like what you are trying to do. The good thing is that it allows you to pass pretty much any Hadoop parameters you want without having to write extra code later. You would do something like this:

public static void main(String[] args) {
    Configuration conf = new Configuration();
    String extraArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs();
    // do something with your non-Hadoop parameters if needed
}

Now the properties you need to define to modify the number of mappers and reducers are respectively mapred.map.tasks and mapred.reduce.tasks, so you can just run your job with these parameters:

-D mapred.map.tasks=42 -D mapred.reduce.tasks

and they will get directly parsed with your GenericOptionParser and populate your Configuration object automatically. Note that there is a space between the -D and the properties, this is important otherwise this will be interpreted as JVM parameters.

Here is a good link if you want to know more about this.

like image 80
Charles Menguy Avatar answered Feb 23 '23 13:02

Charles Menguy


You can specify the number of mappers and reducers ( and really any parameter you can specify in the config), by using the -D parameter. This works for all default Hadoop jars and your own jars as long as you extends Configured.

hadoop jar myJar.jar -Dmapreduce.job.maps=<Number of maps> -Dmapreduce.job.reduces=<Number of reducers>

From there you can retreive the values using.

configuration.get("mapreduce.job.maps"); 
configuration.get("mapreduce.job.reduces");

or for Reducers

job.getNumReduceTasks();

Specifying the mappers with the configuration values will not work when mapreduce.jobtracker.address is "local". See Charles' answer where he explains how Hadoop usually determines the number of Mappers by data size.

like image 42
greedybuddha Avatar answered Feb 23 '23 15:02

greedybuddha