I'm attempting to do something simple in Hadoop and found that when writing mappers and reducers are defined as static everywhere. My task is going to be decomposed into several map
parts and one final reduce
. What if I'd like to reuse one of my mappers in other job? If I have my mapper class defined as inner static
one can I use it in other job? Also non-trivial problems may require many more and complicated mappers, so putting them all in one giant file gets terrible when maintaining.
Is there any way to have mappers and reducers as a regular classes (possibly even in separate jar) than the job itself?
When declaring mapper and reducer classes as inner classes to another class, they have to be declared static such that they are not dependent on the parent class.
Mapper as a static method In Java, it's a good practice to mark a pure function as static even if they are private. This will make explicit that the method is independent of any instance even it doesn't guarantee that it's a pure function.
The Hadoop Java programs are consist of Mapper class and Reducer class along with the driver class. Hadoop Mapper is a function or task which is used to process all input records from a file and generate the output which works as input for Reducer. It produces the output by returning new key-value pairs.
The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input parameters and the second two represent intermediate output parameters. What are the four basic parameters of a reducer? The four basic parameters of a reducer are Text, IntWritable, Text, IntWritable.
Is your question whether the class has to be static, may be static, or may be inner, or should be inner?
Hadoop itself needs to be able to instantiate your Mapper
or Reducer
by reflection, given the class reference/name configured in your Job
. This will fail if it is a non-static inner class since an instance can be created only in the context of some other of your classes which presumably Hadoop knows nothing about. (Unless the inner class extends its enclosing class, I suppose.)
So to answer the first question: it should not be non-static, since this almost surely makes it unusable. To answer the second and third: and it can be a static (inner) class.
To me a Mapper
or Reducer
is plainly a top-level concept and deserves a top-level class. Some like to make them inner static to pair them with a "Runner" class. I don't like this as it is really what subpackages are for. You note another design reason to avoid this. To the fourth question: no, I believe inner classes are not good practice.
Final question: yes the Mapper
and Reducer
classes can be in a separate JAR file. You tell Hadoop which JAR files contains all of this code, and that's the one it will ship off to workers. The workers don't need your Job
. However they need anything that the Mapper
and Reducer
depends on in their same JAR.
I feel the above answer is much precise and does satisfy the rationale. Except, I feel that inner classes should be harnessed while creating the map and reduce. IMO, all the code should be at one place.
And generics can be utilised thoughtfully in the single class ensuring there are no typecasting errors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With