Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Question

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce? Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer. Note that I'm not talking about chaining mappers here, I'm talking about running different mappers in parallel, not sequentially.

Chris Gerken · Accepted Answer

This is called a join.

You want to use the mappers and reducers in the mapred.* packages (older, but still supported). The newer packages (mapreduce.*) only allow for one mapper input. With the mapred packages, you use the MultipleInputs class to define the join:

MultipleInputs.addInputPath(jobConf, 
                     new Path(countsSource),       
                     SequenceFileInputFormat.class, 
                     CountMapper.class);
MultipleInputs.addInputPath(jobConf, 
                     new Path(dictionarySource), 
                     SomeOtherInputFormat.class, 
                     TranslateMapper.class);

jobConf.setJarByClass(ReportJob.class);
jobConf.setReducerClass(WriteTextReducer.class);

jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(WordInfo.class);

jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);

Judge Mental · Answer

I will answer your question with a question, 2 answers, and an anti-recommendation.

The question is: what benefit do you see in running the heterogeneous map jobs in parallel, as opposed to running them in series, outputting homogeneous results that can be properly shuffled? Is the idea to avoid passing over the same records twice, once with an identity map?

The first answer is to schedule both mapper-only jobs simultaneously, each on half your fleet (or whatever ratio best matches the input data size), outputting homogeneous results, followed by a reducer-only job that performs the join.

The second answer is to create a custom InputFormat that is able to recognize and transform both flavors of the heterogeneous input. This is extremely ugly, but it will allow you to avoid the unnecessary identity map of the first suggestion.

The anti-recommendation is to not use the deprecated Hadoop APIs from Chris' answer. Hadoop is very young, but the APIs are stabilizing around the "new" flavor. You will arrive at version lock-in eventually.

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Tags:

hadoop

mapreduce

tibbe

2 Answers

Chris Gerken

Judge Mental

Recent Activity

Donate For Us

Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?

Tags:

hadoop

mapreduce

tibbe

2 Answers

Chris Gerken

Judge Mental

Related questions

Recent Activity

Donate For Us