Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about chaining mappers here, I'm talking about running different mappers in parallel, not sequentially.

like image 359
tibbe Avatar asked Jun 16 '12 00:06

tibbe


2 Answers

This is called a join.

You want to use the mappers and reducers in the mapred.* packages (older, but still supported). The newer packages (mapreduce.*) only allow for one mapper input. With the mapred packages, you use the MultipleInputs class to define the join:

MultipleInputs.addInputPath(jobConf, 
                     new Path(countsSource),       
                     SequenceFileInputFormat.class, 
                     CountMapper.class);
MultipleInputs.addInputPath(jobConf, 
                     new Path(dictionarySource), 
                     SomeOtherInputFormat.class, 
                     TranslateMapper.class);

jobConf.setJarByClass(ReportJob.class);
jobConf.setReducerClass(WriteTextReducer.class);

jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(WordInfo.class);

jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
like image 145
Chris Gerken Avatar answered Oct 05 '22 08:10

Chris Gerken


I will answer your question with a question, 2 answers, and an anti-recommendation.

The question is: what benefit do you see in running the heterogeneous map jobs in parallel, as opposed to running them in series, outputting homogeneous results that can be properly shuffled? Is the idea to avoid passing over the same records twice, once with an identity map?

The first answer is to schedule both mapper-only jobs simultaneously, each on half your fleet (or whatever ratio best matches the input data size), outputting homogeneous results, followed by a reducer-only job that performs the join.

The second answer is to create a custom InputFormat that is able to recognize and transform both flavors of the heterogeneous input. This is extremely ugly, but it will allow you to avoid the unnecessary identity map of the first suggestion.

The anti-recommendation is to not use the deprecated Hadoop APIs from Chris' answer. Hadoop is very young, but the APIs are stabilizing around the "new" flavor. You will arrive at version lock-in eventually.

like image 39
Judge Mental Avatar answered Oct 05 '22 09:10

Judge Mental