Calling a mapreduce job from a simple java program


I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[]) method by also passing the input and output paths for the mapreduce job.. But the program dint work..

How do I run such a program where I just use pass input, output and jar path to its main method?? Is it possible to run a mapreduce job (jar) through it?? I want to do this because I want to run several mapreduce jobs one after another where my java program vl call each such job by referring its jar file.. If this gets possible, I might as well just use a simple servlet to do such calling and refer its output files for the graph purpose..

/*  * To change this template, choose Tools | Templates  * and open the template in the editor.  */  /**  *  * @author root  */ import org.apache.hadoop.util.RunJar; import java.util.*;  public class callOther {      public static void main(String args[])throws Throwable     {          ArrayList arg=new ArrayList();          String output="/root/Desktp/output";          arg.add("/root/NetBeansProjects/wordTool/dist/wordTool.jar");          arg.add("/root/Desktop/input");         arg.add(output);          RunJar.main((String[])arg.toArray(new String[0]));      } } 
2 Answers

Oh please don't do it with runJar, the Java API is very good.

See how you can start a job from normal code:

// create a configuration Configuration conf = new Configuration(); // create a new job based on the configuration Job job = new Job(conf); // here you have to put your mapper class job.setMapperClass(Mapper.class); // here you have to put your reducer class job.setReducerClass(Reducer.class); // here you have to set the jar which is containing your  // map/reduce class, so you can use the mapper class job.setJarByClass(Mapper.class); // key/value of your reducer output job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); // this is setting the format of your input, can be TextInputFormat job.setInputFormatClass(SequenceFileInputFormat.class); // same with output job.setOutputFormatClass(TextOutputFormat.class); // here you can set the path of your input SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/")); // this deletes possible output paths to prevent job failures FileSystem fs = FileSystem.get(conf); Path out = new Path("files/out/processed/"); fs.delete(out, true); // finally set the empty out path TextOutputFormat.setOutputPath(job, out);  // this waits until the job completes and prints debug out to STDOUT or whatever // has been configured in your log4j properties. job.waitForCompletion(true); 

If you are using an external cluster, you have to put the following infos to your configuration via:

// this should be like defined in your mapred-site.xml conf.set("mapred.job.tracker", "jobtracker.com:50001");  // like defined in hdfs-site.xml conf.set("fs.default.name", "hdfs://namenode.com:9000"); 

This should be no problem when the hadoop-core.jar is in your application containers classpath. But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)

For YARN (> Hadoop 2)

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001");   // framework is now "yarn", should be defined like this in mapred-site.xm conf.set("mapreduce.framework.name", "yarn");  // like defined in hdfs-site.xml conf.set("fs.default.name", "hdfs://namenode.com:9000"); 
Calling MapReduce job from java web application (Servlet)

You can call a MapReduce job from web application using Java API. Here is a small example of calling a MapReduce job from servlet. The steps are given below:

Step 1: At first create a MapReduce driver servlet class. Also develop map & reduce service. Here goes a sample code snippet:


    public class CallJobFromServlet extends HttpServlet {      protected void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException {      Configuration conf = new Configuration();     // Replace CallJobFromServlet.class name with your servlet class         Job job = new Job(conf, " CallJobFromServlet.class");          job.setJarByClass(CallJobFromServlet.class);         job.setJobName("Job Name");         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(Text.class);         job.setMapperClass(Map.class); // Replace Map.class name with your Mapper class         job.setNumReduceTasks(30);         job.setReducerClass(Reducer.class); //Replace Reduce.class name with your Reducer class         job.setMapOutputKeyClass(Text.class);         job.setMapOutputValueClass(Text.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);          // Job Input path         FileInputFormat.addInputPath(job, new           Path("hdfs://localhost:54310/user/hduser/input/"));          // Job Output path         FileOutputFormat.setOutputPath(job, new          Path("hdfs://localhost:54310/user/hduser/output"));           job.waitForCompletion(true);    } } 

Step 2: Place all the related jar (hadoop, application specific jars) files inside lib folder of the web server (e.g. Tomcat). This is mandatory for accessing the Hadoop configurations ( hadoop ‘conf’ folder has configuration xml files i.e. core-site.xml , hdfs-site.xml etc ) . Just copy the jars from hadoop lib folder to web server(tomcat) lib directory. The list of jar names are as follows:

1.  commons-beanutils-1.7.0.jar 2.  commons-beanutils-core-1.8.0.jar 3.  commons-cli-1.2.jar 4.  commons-collections-3.2.1.jar 5.  commons-configuration-1.6.jar 6.  commons-httpclient-3.0.1.jar 7.  commons-io-2.1.jar 8.  commons-lang-2.4.jar 9.  commons-logging-1.1.1.jar 10. hadoop-client-1.0.4.jar 11. hadoop-core-1.0.4.jar 12. jackson-core-asl-1.8.8.jar 13. jackson-mapper-asl-1.8.8.jar 14. jersey-core-1.8.jar 

Step 3: Deploy your web application into web server (in ’webapps’ folder for Tomcat).

Step 4: Create a jsp file and link the servlet class (CallJobFromServlet.java) in form action attribute. Here goes a sample code snippet:


<form id="trigger_hadoop" name="trigger_hadoop" action="./CallJobFromServlet ">       <span class="back">Trigger Hadoop Job from Web Page </span>        <input type="submit" name="submit" value="Trigger Job" />       </form> 
