I'm using the new Hadoop API to write a sequence of map-reduce jobs. I plan to use Oozie to pipeline all of these together, but I can't seem to find a way to do multiple output streams from a map-reduce
node in the workflow.
Normally to write multiple outputs I would use code similar to the code given in the MultipleOutputs javadoc, but oozie gets all its configuration from workflow.xml
file so the named outputs cannot be configured like they are in the example.
I've come across a thread discussing the use of multiple outputs in Oozie, but there was no solution presented beyond creating a Java task and adding it to the Oozie pipline directly.
Is there a way to this via a map-reduce
node in the workflow.xml
?
Edit:
Chris's solution did work, though I wish there was a better way. Here are the exact changes I made.
I added the following to the workflow.xml file:
<property>
<name>mapreduce.multipleoutputs</name>
<value>${output1} ${output2}</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output1}.key</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output1}.value</name>
<value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output1}.format</name>
<value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output2}.key</name>
<value>org.apache.hadoop.io.Text</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output2}.value</name>
<value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
<name>mapreduce.multipleoutputs.namedOutput.${output2}.format</name>
<value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>
I added the following to the job.properties file that is fed to oozie at startup:
output1=totals
output2=uniques
Then in the reducer I wrote to the named outputs totals
and uniques
.
the addNamedOutput
utility methods for MultipleOutputs is just configuring configuration properties - so go look at an instance of your job that has run and extract the properties for MultipleOutputs (look in the job.xml, lined from the JobTracker page).
Alternatively, look through the source for MultipleOutputs and see what configuration properties are being set when you call this method.
Once you know the properties being set, add them to the configuration section of map-reduce element in your Oozie workflow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With