Chaining multiple MapReduce jobs in Hadoop

People also ask

What is the preferred file format when chaining multiple MapReduce jobs?

MapReduce is a computation abstraction that works well with The Hadoop Distributed File System (HDFS).

Can you provide multiple input paths to a MapReduce jobs?

We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.

Can we have multiple reducers in MapReduce?

If there are lot of key-values to merge, a single reducer might take too much time. To avoid reducer machine becoming the bottleneck, we use multiple reducers. When you have multiple reducers, each node that is running mapper puts key-values in multiple buckets just after sorting.

Is there any limit to number of multistep MapReduce jobs?

You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.

I think this tutorial on Yahoo's developer network will help you with this: Chaining Jobs

You use the JobClient.runJob(). The output path of the data from the first job becomes the input path to your second job. These need to be passed in as arguments to your jobs with appropriate code to parse them and set up the parameters for the job.

I think that the above method might however be the way the now older mapred API did it, but it should still work. There will be a similar method in the new mapreduce API but i'm not sure what it is.

As far as removing intermediate data after a job has finished you can do this in your code. The way i've done it before is using something like:

FileSystem.delete(Path f, boolean recursive);

Where the path is the location on HDFS of the data. You need to make sure that you only delete this data once no other job requires it.

There are many ways you can do it.

(1) Cascading jobs

Create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job:

JobClient.run(job1).

Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Execute this job:

JobClient.run(job2).

(2) Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.

Then create two Job objects with jobconfs as parameters:

Job job1=new Job(jobconf1); 
Job job2=new Job(jobconf2);

Using the jobControl object, you specify the job dependencies and then run the jobs:

JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();

(3) If you need a structure somewhat like Map+ | Reduce | Map*, you can use the ChainMapper and ChainReducer classes that come with Hadoop version 0.19 and onwards.

There are actually a number of ways to do this. I'll focus on two.

One is via Riffle ( http://github.com/cwensel/riffle ) an annotation library for identifying dependent things and 'executing' them in dependency (topological) order.

Or you can use a Cascade (and MapReduceFlow) in Cascading ( http://www.cascading.org/ ). A future version will support Riffle annotations, but it works great now with raw MR JobConf jobs.

A variant on this is to not manage MR jobs by hand at all, but develop your application using the Cascading API. Then the JobConf and job chaining is handled internally via the Cascading planner and Flow classes.

This way you spend your time focusing on your problem, not on the mechanics of managing Hadoop jobs etc. You can even layer different languages on top (like clojure or jruby) to even further simplify your development and applications. http://www.cascading.org/modules.html

You may run MR chain in the manner as given in the code.

PLEASE NOTE: Only the driver code has been provided

public class WordCountSorting {
// here the word keys shall be sorted
      //let us write the wordcount logic first

      public static void main(String[] args)throws IOException,InterruptedException,ClassNotFoundException {
            //THE DRIVER CODE FOR MR CHAIN
            Configuration conf1=new Configuration();
            Job j1=Job.getInstance(conf1);
            j1.setJarByClass(WordCountSorting.class);
            j1.setMapperClass(MyMapper.class);
            j1.setReducerClass(MyReducer.class);

            j1.setMapOutputKeyClass(Text.class);
            j1.setMapOutputValueClass(IntWritable.class);
            j1.setOutputKeyClass(LongWritable.class);
            j1.setOutputValueClass(Text.class);
            Path outputPath=new Path("FirstMapper");
            FileInputFormat.addInputPath(j1,new Path(args[0]));
                  FileOutputFormat.setOutputPath(j1,outputPath);
                  outputPath.getFileSystem(conf1).delete(outputPath);
            j1.waitForCompletion(true);
                  Configuration conf2=new Configuration();
                  Job j2=Job.getInstance(conf2);
                  j2.setJarByClass(WordCountSorting.class);
                  j2.setMapperClass(MyMapper2.class);
                  j2.setNumReduceTasks(0);
                  j2.setOutputKeyClass(Text.class);
                  j2.setOutputValueClass(IntWritable.class);
                  Path outputPath1=new Path(args[1]);
                  FileInputFormat.addInputPath(j2, outputPath);
                  FileOutputFormat.setOutputPath(j2, outputPath1);
                  outputPath1.getFileSystem(conf2).delete(outputPath1, true);
                  System.exit(j2.waitForCompletion(true)?0:1);
      }

}

THE SEQUENCE IS

(JOB1)MAP->REDUCE-> (JOB2)MAP
This was done to get the keys sorted yet there are more ways such as using a treemap
Yet I want to focus your attention onto the way the Jobs have been chained!!
Thank you

I have done job chaining using with JobConf objects one after the other. I took WordCount example for chaining the jobs. One job figures out how many times a word a repeated in the given output. Second job takes first job output as input and figures out total words in the given input. Below is the code that need to be placed in Driver class.

    //First Job - Counts, how many times a word encountered in a given file 
    JobConf job1 = new JobConf(WordCount.class);
    job1.setJobName("WordCount");

    job1.setOutputKeyClass(Text.class);
    job1.setOutputValueClass(IntWritable.class);

    job1.setMapperClass(WordCountMapper.class);
    job1.setCombinerClass(WordCountReducer.class);
    job1.setReducerClass(WordCountReducer.class);

    job1.setInputFormat(TextInputFormat.class);
    job1.setOutputFormat(TextOutputFormat.class);

    //Ensure that a folder with the "input_data" exists on HDFS and contains the input files
    FileInputFormat.setInputPaths(job1, new Path("input_data"));

    //"first_job_output" contains data that how many times a word occurred in the given file
    //This will be the input to the second job. For second job, input data name should be
    //"first_job_output". 
    FileOutputFormat.setOutputPath(job1, new Path("first_job_output"));

    JobClient.runJob(job1);


    //Second Job - Counts total number of words in a given file

    JobConf job2 = new JobConf(TotalWords.class);
    job2.setJobName("TotalWords");

    job2.setOutputKeyClass(Text.class);
    job2.setOutputValueClass(IntWritable.class);

    job2.setMapperClass(TotalWordsMapper.class);
    job2.setCombinerClass(TotalWordsReducer.class);
    job2.setReducerClass(TotalWordsReducer.class);

    job2.setInputFormat(TextInputFormat.class);
    job2.setOutputFormat(TextOutputFormat.class);

    //Path name for this job should match first job's output path name
    FileInputFormat.setInputPaths(job2, new Path("first_job_output"));

    //This will contain the final output. If you want to send this jobs output
    //as input to third job, then third jobs input path name should be "second_job_output"
    //In this way, jobs can be chained, sending output one to other as input and get the
    //final output
    FileOutputFormat.setOutputPath(job2, new Path("second_job_output"));

    JobClient.runJob(job2);

Command to run these jobs is:

bin/hadoop jar TotalWords.

We need to give final jobs name for the command. In the above case, it is TotalWords.

You can use oozie for barch processing your MapReduce jobs. http://issues.apache.org/jira/browse/HADOOP-5303

Related questions
                            
                                Hadoop truncated/inconsistent counter name
                            
                                How to check if ZooKeeper is running or up from command prompt?
                            
                                When do reduce tasks start in Hadoop?
                            
                                How do I output the results of a HiveQL query to CSV?
                            
                                Large scale data processing Hbase vs Cassandra [closed]
                            
                                Container is running beyond memory limits
                            
                                Parquet vs ORC vs ORC with Snappy
                            
                                What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
                            
                                How to know Hive and Hadoop versions from command prompt?
                            
                                Is there a .NET equivalent to Apache Hadoop? [closed]
                            
                                Avro vs. Parquet
                            
                                hadoop No FileSystem for scheme: file
                            
                                Can apache spark run without hadoop?
                            
                                The way to check a HDFS directory's size?
                            
                                connect to host localhost port 22: Connection refused
                            
                                How does the MapReduce sort algorithm work?
                            
                                Difference between Hive internal tables and external tables?
                            
                                what's the difference between "hadoop fs" shell commands and "hdfs dfs" shell commands?
                            
                                Failed to locate the winutils binary in the hadoop binary path
                            
                                How does Hadoop process records split across block boundaries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Chaining multiple MapReduce jobs in Hadoop

Tags:

hadoop

mapreduce

People also ask

Recent Activity

Donate For Us