Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the name of input file in MRjob

I'm writing a map function using mrjob. My input will come from files in a directory on HDFS. Names of the files contain a small but crucial piece information that is not present in the files. Is there a way to learn (inside a map function) the name of the input file from which a given key-value pair comes?

I'm looking for an equivalent of this Java code:

FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();

Thanks in advance!

like image 909
Bolo Avatar asked Jul 11 '12 14:07

Bolo


People also ask

What does a combiner do Mrjob?

A combiner takes a key and a subset of the values for that key as input and returns zero or more (key, value) pairs. Combiners are optimizations that run immediately after each mapper and can be used to decrease total data transfer.

Can you write command to run Mr job?

When we want to run the mrjob code on Hadoop or Amazon EMR we have to specify the -r/–runner option with the command.


2 Answers

map.input.file property will give the input file name.

According to the Hadoop - The Definitive Guide

The properties can be accessed from the job’s configuration, obtained in the old MapReduce API by providing an implementation of the configure() method for Mapper or Reducer, where the configuration is passed in as an argument. In the new API, these properties can be accessed from the context object passed to all methods of the Mapper or Reducer.

like image 133
Praveen Sripati Avatar answered Sep 17 '22 20:09

Praveen Sripati


If you are using HADOOP 2.x with Python:

file_name = os.environ['mapreduce_map_input_file']
like image 35
Boggio Avatar answered Sep 18 '22 20:09

Boggio