Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MultipleTextOutputFormat alternative in new API

As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?

like image 608
Amar Avatar asked Feb 26 '13 22:02

Amar


1 Answers

I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs class:

public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)

or

public <K,V> void write(String namedOutput, K key, V value,
                        String baseOutputPath)

The former write method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.

The latter write method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput function:

public static void addNamedOutput(Job job,
                              String namedOutput,
                              Class<? extends OutputFormat> outputFormatClass,
                              Class<?> keyClass,
                              Class<?> valueClass)

So if you need different output types than the Context is using, you must use the latter write method.

The trick to getting different output directories is to pass a baseOutputPath that contains a directory separator, like this:

multipleOutputs.write("output1", key, value, "dir1/part");

In my case, this created files named "dir1/part-r-00000".

I was not successful in using a baseOutputPath that contains the .. directory, so all baseOutputPaths are strictly contained in the path passed to the -output parameter.

For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java

like image 99
Eddified Avatar answered Nov 14 '22 20:11

Eddified