Difference in calling the job

Question

what is the difference between calling a mapreduce job from main() and from ToolRunner.run()? When we say that the main class say, MapReduce extends Configured implements Tool , what are the additional privileges we get which we do not have if we were to just make a simple run of the job from the main method? Thanks.

Chris White · Accepted Answer

There's no extra privileges, but your command line options get run via the GenericOptionsParser, which will allow you extract certain configuration properties and configure a Configuration object from it:

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html

Basically rather that parsing some options yourself (using the index of the argument in the list), you can explicitly configure Configuration properties from the command line:

hadoop jar myJar.jar com.Main prop1value prop2value

public static void main(String args[]) {
    Configuration conf = new Configuration();
    conf.set("prop1", args[0]);
    conf.set("prop2", args[1]);

    conf.get("prop1"); // will resolve to "prop1Value"
    conf.get("prop2"); // will resolve to "prop2Value"
}

Becomes much more condensed with ToolRunner:

hadoop jar myJar.jar com.Main -Dprop1=prop1value -Dprop2=prop2value

public int run(String args[]) {
    Configuration conf = getConf();

    conf.get("prop1"); // will resolve to "prop1Value"
    conf.get("prop2"); // will resolve to "prop2Value"
}

One final word of warning though: when using the Configuration method getConf(), create your Job object first, then pull its Configuration out - the Job constructor makes a copy of the Configruation object passed in, so if you makes changes to the reference passed in, you job will not see those changes:

public int run(String args[]) {
    Configuration conf = getConf();

    conf.set("prop3", "blah");

    Job job = new Job(conf); // job will have a deep copy of conf

    conf.set("prop4", "dummy"); // here we're amending the original conf

    job.getConfiguration().get("prop4"); // will resolve to null
}

Tejas Patil · Answer

By using ToolRunner.run(), any hadoop application can handle standard command line options supported by hadoop. ToolRunner uses GenericOptionsParser internally. In short, the hadoop specific options which are provided command line are parsed and set into the Configuration object of the application. If you simply use main(), this wont happen automatically.

eg. If you say:

% hadoop MyHadoopApp -D mapred.reduce.tasks=3

Then ToolRunner.run(new MyHadoopApp(), args) will automatically set the value parameter mapred.reduce.tasks to 3 in the Configuration object.

There are NO additional privileges which we we get. Typically people don't use simply main() in hadoop jobs. Using ToolRunner.run() is a standard practice.

Difference in calling the job

Tags:

java

hadoop

mapreduce

Ravi Trivedi

2 Answers

Chris White

Tejas Patil

Recent Activity

Donate For Us

Difference in calling the job

Tags:

java

hadoop

mapreduce

Ravi Trivedi

2 Answers

Chris White

Tejas Patil

Related questions

Recent Activity

Donate For Us