How to get the name of input file in MRjob

Tags:

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the name of the input file from which a given key-value pair comes?

I'm looking for an equivalent of this Java code:

FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();

Thanks in advance!

909

asked Jul 11 '12 14:07

Bolo

2 Answers

map.input.file property will give the input file name.

According to the Hadoop - The Definitive Guide

The properties can be accessed from the job’s configuration, obtained in the old MapReduce API by providing an implementation of the configure() method for Mapper or Reducer, where the configuration is passed in as an argument. In the new API, these properties can be accessed from the context object passed to all methods of the Mapper or Reducer.

133

answered Sep 17 '22 20:09

Praveen Sripati

If you are using HADOOP 2.x with Python:

file_name = os.environ['mapreduce_map_input_file']

answered Sep 18 '22 20:09

Boggio

Related questions
                            
                                How to invert numpy.where (np.where) function
                            
                                SqlAlchemy: array of Postgresql custom types
                            
                                Python Mixin for __str__and Method Resolution Order
                            
                                How to safely use exec() in Python?
                            
                                Having trouble building python deb package , complain about modified binary
                            
                                PySide SVG image formats not found?
                            
                                Python: switching from optparse to argparse
                            
                                Which version of fabric API is installed
                            
                                How to avoid object creation in python?
                            
                                Upload video and create thumbnail from video in django
                            
                                Making a countdown timer with Python and Tkinter?
                            
                                Import Error for User Model
                            
                                Python to C converter / interpreter [closed]
                            
                                Sorting a list of lists by length and by value
                            
                                PNG options to produce smaller file size when using savefig
                            
                                Python: Importing Module
                            
                                scons - running program after compilation
                            
                                __getattr__ going recursive in python
                            
                                python interactive shell 16x faster than command line - what's wrong?
                            
                                Python, Tkinter: How to get coordinates on scrollable canvas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the name of input file in MRjob

Tags:

python

hadoop

hadoop-streaming

mrjob

Bolo

People also ask

2 Answers

Praveen Sripati

Boggio

Recent Activity

Donate For Us