Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop - constructor args for mapper

Tags:

java

scala

hadoop

Is there any way to give constructor args to a Mapper in Hadoop? Possibly through some library that wraps the Job creation?

Here's my scenario:

public class HadoopTest {

    // Extractor turns a line into a "feature"
    public static interface Extractor {
        public String extract(String s);
    }

    // A concrete Extractor, configurable with a constructor parameter
    public static class PrefixExtractor implements Extractor {
        private int endIndex;

        public PrefixExtractor(int endIndex) { this.endIndex = endIndex; }

        public String extract(String s) { return s.substring(0, this.endIndex); }
    }

    public static class Map extends Mapper<Object, Text, Text, Text> {
        private Extractor extractor;

        // Constructor configures the extractor
        public Map(Extractor extractor) { this.extractor = extractor; }

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String feature = extractor.extract(value.toString());
            context.write(new Text(feature), new Text(value.toString()));
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            for (Text val : values) context.write(key, val);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "test");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.waitForCompletion(true);
    }
}

As should be clear, since the Mapper is only given to the Configuration as a class reference (Map.class), Hadoop has no way to pass a constructor argument and configure a specific Extractor.

There are Hadoop-wrapping frameworks out there like Scoobi, Crunch, Scrunch (and probably many more I don't know about) that seem to have this capability, but I don't know how they accomplish it. EDIT: After some more working with Scoobi, I discovered I was partially wrong about this. If you use an externally defined object in the "mapper", Scoobi requires that it be serializable, and will complain at runtime if it isn't. So maybe the right way is just to make my Extractor serializable and de-serialize it in the Mapper's setup method...

Also, I actually work in Scala, so Scala-based solutions are definitely welcome (if not encouraged!)

like image 280
dhg Avatar asked Nov 15 '11 18:11

dhg


1 Answers

I'd suggest telling your mapper which extractor to use via the Configuration object you're creating. The mapper receives the configuration in its setup method (context.getConfiguration()). It seems like you can't put objects in the configuration as it is usually constructed from XML files or the command line, but you could set an enum value and have the mapper construct its extractor itself. Not very pretty to customize the mapper after its creation, but that's how I interpreted the API.

like image 99
wutz Avatar answered Sep 29 '22 03:09

wutz