Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way to use oozie to write to multiple output streams for a mapreduce job?

I'm using the new Hadoop API to write a sequence of map-reduce jobs. I plan to use Oozie to pipeline all of these together, but I can't seem to find a way to do multiple output streams from a map-reduce node in the workflow.

Normally to write multiple outputs I would use code similar to the code given in the MultipleOutputs javadoc, but oozie gets all its configuration from workflow.xml file so the named outputs cannot be configured like they are in the example.

I've come across a thread discussing the use of multiple outputs in Oozie, but there was no solution presented beyond creating a Java task and adding it to the Oozie pipline directly.

Is there a way to this via a map-reduce node in the workflow.xml?

Edit:

Chris's solution did work, though I wish there was a better way. Here are the exact changes I made.

I added the following to the workflow.xml file:

<property>
    <name>mapreduce.multipleoutputs</name>
   <value>${output1} ${output2}</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>

I added the following to the job.properties file that is fed to oozie at startup:

output1=totals
output2=uniques

Then in the reducer I wrote to the named outputs totals and uniques.

like image 947
coltfred Avatar asked Mar 21 '12 17:03

coltfred


1 Answers

the addNamedOutput utility methods for MultipleOutputs is just configuring configuration properties - so go look at an instance of your job that has run and extract the properties for MultipleOutputs (look in the job.xml, lined from the JobTracker page).

Alternatively, look through the source for MultipleOutputs and see what configuration properties are being set when you call this method.

Once you know the properties being set, add them to the configuration section of map-reduce element in your Oozie workflow.

like image 63
Chris White Avatar answered Oct 23 '22 11:10

Chris White