Using only a mapper (a Python script) and no reducer, how can I output a separate file with the key as the filename, for each line of output, rather than having long files of output?

The input and outputformat classes can be replaced by use of the -inputformat and -outputformat commandline parameters. One example of how to do this can be found in the dumbo project, which is a python framework for writing streaming jobs. It has a feature for writing to multiple files, and internally it replaces the output format with a class from its sister project, feathers - fm.last.feathers.output.MultipleTextFiles. The reducer then needs to emit a tuple as key, with the first component of the tuple being the path to the directory where the files with the key/value pairs should be written. There might still be multiple files, that depends on the number of reducers and the application. I recommend looking into dumbo, it has many features that makes it easier to write Map/Reduce programs on Hadoop in python.

Generating Separate Output files in Hadoop Streaming

3 Answers

The input and outputformat classes can be replaced by use of the -inputformat and -outputformat commandline parameters.

One example of how to do this can be found in the dumbo project, which is a python framework for writing streaming jobs. It has a feature for writing to multiple files, and internally it replaces the output format with a class from its sister project, feathers - fm.last.feathers.output.MultipleTextFiles.

The reducer then needs to emit a tuple as key, with the first component of the tuple being the path to the directory where the files with the key/value pairs should be written. There might still be multiple files, that depends on the number of reducers and the application.

I recommend looking into dumbo, it has many features that makes it easier to write Map/Reduce programs on Hadoop in python.

165

answered Nov 03 '22 08:11

Erik Forsberg

You can either write to a text file on the local filesystem using python file functions or if you want to use HDFS use the Thrift API.

answered Nov 03 '22 09:11

Mihai A

Is it possible to replace the outputFormatClass, when using streaming? In a native Java implementation you would extend the MultipleTextOutputFormat class and modify the method that names the output file. Then define your implementation as new outputformat with JobConf's setOutputFormat method

you should verify, if this is possible in streaming too. I donno :-/

answered Nov 03 '22 08:11

Peter Wippermann

Related questions
                            
                                Open database files (.db) using python
                            
                                What is the purpose of floating point index in Pandas?
                            
                                Decorators on Python AbstractMethods
                            
                                Python 3.8 shared_memory resource_tracker producing unexpected warnings at application close
                            
                                How to test a dockerized application in an Azure DevOps (Server) pipeline?
                            
                                Color calibration with color checker using using Root-Polynomial Regression not giving correct results
                            
                                Django: Subprocess Continuous Output To HTML View
                            
                                Using pyspark in Google Colab
                            
                                How to use an existing sqlalchemy Enum in an Alembic migration (Postgres)
                            
                                How can I make bandit skip B101 within tests?
                            
                                How do I list all currently available GPUs with pytorch?
                            
                                Local scope vs relative imports inside __init__.py
                            
                                How performant is Python's pattern matching? Is it O(1)?
                            
                                Rectangle detection inaccuracy using approxPolyDP() in openCV
                            
                                pagination with the python cmd module
                            
                                What is a good strategy for constructing a directed graph for a game map (in Python)?
                            
                                save an image with selenium & firefox
                            
                                How can I serve unbuffered CGI content from Apache 2?
                            
                                Why does Django post_save signal give me pre_save data?
                            
                                Making a Windows .exe with gui2exe does not work because of missing MSVCP90.dll

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generating Separate Output files in Hadoop Streaming

Tags:

python

hadoop

streaming

mapreduce

Ryan R. Rosario

People also ask

3 Answers

Erik Forsberg

Mihai A

Peter Wippermann

Recent Activity

Donate For Us