Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming.

for example:

                  / out1/part-0000
mapper -> reducer   
                  \ out2/part-0000

If anyone knows, heard, done similar thing, please let me know

like image 521
daydreamer Avatar asked Nov 04 '22 13:11

daydreamer


1 Answers

The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.

Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.

With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.

like image 133
Erik Forsberg Avatar answered Nov 09 '22 16:11

Erik Forsberg