How do I control output files name and content of an Hadoop streaming job?

Question

Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.

Update: Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

I haven't seen any samples for this out there... Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?

Eran Kampf · Accepted Answer

Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...

EDIT: As of version 0.20.2 of hadoop this Class has been deprecated and you should now use: http://hadoop.apache.org/docs/mapreduce/current/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

How do I control output files name and content of an Hadoop streaming job?

Tags:

distributed-computing

hadoop

mapreduce

Eran Kampf

1 Answers

Eran Kampf

Recent Activity

Donate For Us

How do I control output files name and content of an Hadoop streaming job?

Tags:

distributed-computing

hadoop

mapreduce

Eran Kampf

1 Answers

Eran Kampf

Related questions

Recent Activity

Donate For Us