I read Hadoop in Action and found that in Java
using MultipleOutputFormat
and MultipleOutputs
classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming
.
for example:
/ out1/part-0000
mapper -> reducer
\ out2/part-0000
If anyone knows, heard, done similar thing, please let me know
The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.
Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.
With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With