Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating Separate Output files in Hadoop Streaming

Using only a mapper (a Python script) and no reducer, how can I output a separate file with the key as the filename, for each line of output, rather than having long files of output?

like image 293
Ryan R. Rosario Avatar asked Oct 26 '09 19:10

Ryan R. Rosario


People also ask

Why does Hadoop create multiple output files?

MultipleOutputs class provide facility to write Hadoop map/reducer output to more than one folders. Basically, we can use MultipleOutputs when we want to write outputs other than map reduce job default output and write map reduce job output to different files provided by a user.

Can we write the output of MapReduce in different formats?

MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them.

Is Hadoop capable of having multiple inputs?

If Multiple input files are present in the same directory – By default hadoop doesnt read the directory recursively. But suppose if multiple input files like data1, data2,etc are present in /folder1, then Set mapreduce. input. fileinputformat.

Which is the tool of Hadoop streaming data transfer?

Which is the tool of Hadoop streaming data transfer? Apache Flume – Data Transfer In Hadoop.


3 Answers

The input and outputformat classes can be replaced by use of the -inputformat and -outputformat commandline parameters.

One example of how to do this can be found in the dumbo project, which is a python framework for writing streaming jobs. It has a feature for writing to multiple files, and internally it replaces the output format with a class from its sister project, feathers - fm.last.feathers.output.MultipleTextFiles.

The reducer then needs to emit a tuple as key, with the first component of the tuple being the path to the directory where the files with the key/value pairs should be written. There might still be multiple files, that depends on the number of reducers and the application.

I recommend looking into dumbo, it has many features that makes it easier to write Map/Reduce programs on Hadoop in python.

like image 165
Erik Forsberg Avatar answered Nov 03 '22 08:11

Erik Forsberg


You can either write to a text file on the local filesystem using python file functions or if you want to use HDFS use the Thrift API.

like image 40
Mihai A Avatar answered Nov 03 '22 09:11

Mihai A


Is it possible to replace the outputFormatClass, when using streaming? In a native Java implementation you would extend the MultipleTextOutputFormat class and modify the method that names the output file. Then define your implementation as new outputformat with JobConf's setOutputFormat method

you should verify, if this is possible in streaming too. I donno :-/

like image 1
Peter Wippermann Avatar answered Nov 03 '22 08:11

Peter Wippermann