Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MapReduce, Java implementation questions

Currently I'm into Apache Hadoop (with Java implementation of the MapReduce jobs). I looked into some examples (like the WordCount example). I have success with playing around writing custom mapreduce apps (I'm using Cloudera Hadoop Demo VM). My question is about some implementation and runtime questions.

The prototype of the job class is as follows:

public class WordCount {

  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    // mapping
      }
    }
  }

  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
      // reducing
    }
  }

  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    // setting map and reduce classes, and various configs
    JobClient.runJob(conf);
  }
}

I have some questions, I tried to google them, but I must tell that documentation on hadoop is very formal (like a big reference book), not suitable for beginners.

My questions:

  • does the Map and Reduce classes have to be static inner classes in the Main class, or they can be anywhere (just visible from Main?)
  • can you use anything that Java SE and available libraries have to offer like in an ordinary Java SE app? I mean, like JAXB, Guava, Jackson for JSON, etc
  • what is the best practice to write generic solutions? I mean: we want to process big amounts of log files in different (but slightly similar) ways. The last token of the log file is always a JSON map with some entries. One processing could be: count and group by the log rows on (keyA, keyB from the map), and another could be: count and group by the log rows on (keyX, keyY from the map). (I'm thinking of some configfile-based solution, where you can provide the actually necessary entries to the program, you if you need a new resolution, you just have to provide the config and run the app).
  • can be relevant: in the WordCount example the Map and Reduce classes are static inner classes and main() has zero influence on them, just provides these classes to the framework. Can you make these classes non-static, provide some fields and a constructor to alter the runtime with some current values (like the config parameters I mentioned).

Maybe I'm digging in the details unnecessarily. The overall question is: is a hadoop mapreduce program still a normal JavaSE app we are used to?

like image 616
gyorgyabraham Avatar asked May 03 '13 08:05

gyorgyabraham


1 Answers

Here are your answers.

  1. The mapper and reducer classes can be in separate Java classes, anywhere in the package structure or may in seperate jar files as long as the class loader of the MapTask/ReduceTask is able to load the mapper/reducer classes. The example that you shown is for a quick testing for Hadoop beginners.

  2. Yes, you can use any Java libraries. These third party jars should be made available to the MapTask/ReduceTask either through the -files option of hadoop jar command or using Hadoop API. Look at this link here For more information on adding third party libraries to Map/Reduce classpath

  3. Yes, you can configure and pass in the configurations to the Map/Reduce Jobs using either of these approaches.

    3.1 Use the org.apache.hadoop.conf.Configuration object as below to set the configurations in the client program (the Java class with main() method

    Configuration conf = new Configuration(); conf.set("config1", "value1"); Job job = new Job(conf, "Whole File input");

The Map/Reduce programs have access to the Configuration object and get the values set for the properties using get() method. This approach is advisable if the configuration settings are small.

3.2 Use the distributed cache to load the configurations and make it available in the Map/Reduce programs. Click here for details on distributed Cache. This approach is more advisable.

4.The main() is the client program which should be responsible for configuring and submitting the Hadoop job. If none of the configurations set, then the default settings will be used. The configurations such as Mapper class, Reducer Class, Input Path, Output path, Input Format class, Number of reducers etc. For eg:

Additionally, look at the documentation here on Job configuration

Yes, Map/Reduce programs are still a JavaSE programs however, these are distributed across the machines in the Hadoop cluster. Lets say, the Hadoop cluster has 100 nodes and submitted the word count example. The Hadoop framework creates Java process for each of these Map and Reduce tasks and calls the call back methods such as map()/reduce() on subset of machines where the data exists. Essentially, the your mapper/reducer code gets executed on the machine where data exists. I would recommend you to read the Chapter 6 of The Definitive Guide

I hope, this helps.

like image 103
Niranjan Sarvi Avatar answered Sep 20 '22 22:09

Niranjan Sarvi