Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the current filename in Hadoop Reduce

Tags:

java

hadoop

I am using the WordCount example and in the Reduce function, I need to get the file name.

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    String filename = ((FileSplit)(.getContext()).getInputSplit()).getPath().getName();
    // ----------------------------^ I need to get the context and filename!
    key.set(key.toString() + " (" + filename + ")");
    output.collect(key, new IntWritable(sum));
  }
}

This is the above modified code currently, where I wanna get the filename to be printed for the word. I tried following Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file? but I couldn't get the context object.

I am new to hadoop and need this help. Any help guys?

like image 528
Praveen Kumar Purushothaman Avatar asked Dec 17 '13 17:12

Praveen Kumar Purushothaman


3 Answers

You can't get context, because context is a construct of the "new API", and you are using the "old API".

Check out this word count example instead: http://wiki.apache.org/hadoop/WordCount

See the signature of the reduce function in this case:

public void reduce(Text key, Iterable<IntWritable> values, Context context) 

See! The context! Notice in this example it imports from .mapreduce. instead of .mapred..

This is a common issue for new hadoop users, so don't feel bad. In general you want to stick to the new API for a number of reasons. But, be very careful of examples that you find. Also, realize that the new API and old API are not interoperable (e.g., you can't have a new API mapper and an old API reducer).

like image 102
Donald Miner Avatar answered Nov 19 '22 23:11

Donald Miner


Using the old MR API (org.apache.hadoop.mapred package), add the below to the mapper/reducer class.

String fileName = new String();
public void configure(JobConf job)
{
    filename = job.get("map.input.file");
}

Using the new MR API (org.apache.hadoop.mapreduce package), add the below to the mapper/reducer class.

String fileName = new String();
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException
{
    fileName = ((FileSplit) context.getInputSplit()).getPath().toString();
}
like image 4
Praveen Sripati Avatar answered Nov 19 '22 22:11

Praveen Sripati


I used this way and it works!!!

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String filename = fileSplit.getPath().getName();
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

Let me know if I can improve it!

like image 2
Praveen Kumar Purushothaman Avatar answered Nov 20 '22 00:11

Praveen Kumar Purushothaman