Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

Question

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming.

for example:

                  / out1/part-0000
mapper -> reducer   
                  \ out2/part-0000

If anyone knows, heard, done similar thing, please let me know

for example:

                  / out1/part-0000
mapper -> reducer   
                  \ out2/part-0000

If anyone knows, heard, done similar thing, please let me know

Erik Forsberg · Accepted Answer

The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.

Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.

With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

Tags:

python

hadoop

mapreduce

hadoop-streaming

daydreamer

1 Answers

Erik Forsberg

Recent Activity

Donate For Us

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

Tags:

python

hadoop

mapreduce

hadoop-streaming

daydreamer

1 Answers

Erik Forsberg

Related questions

Recent Activity

Donate For Us