Is there any way to give constructor args to a Mapper in Hadoop? Possibly through some library that wraps the Job creation?
Here's my scenario:
public class HadoopTest {
// Extractor turns a line into a "feature"
public static interface Extractor {
public String extract(String s);
}
// A concrete Extractor, configurable with a constructor parameter
public static class PrefixExtractor implements Extractor {
private int endIndex;
public PrefixExtractor(int endIndex) { this.endIndex = endIndex; }
public String extract(String s) { return s.substring(0, this.endIndex); }
}
public static class Map extends Mapper<Object, Text, Text, Text> {
private Extractor extractor;
// Constructor configures the extractor
public Map(Extractor extractor) { this.extractor = extractor; }
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String feature = extractor.extract(value.toString());
context.write(new Text(feature), new Text(value.toString()));
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) context.write(key, val);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "test");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
As should be clear, since the Mapper is only given to the Configuration
as a class reference (Map.class
), Hadoop has no way to pass a constructor argument and configure a specific Extractor.
There are Hadoop-wrapping frameworks out there like Scoobi, Crunch, Scrunch (and probably many more I don't know about) that seem to have this capability, but I don't know how they accomplish it. EDIT: After some more working with Scoobi, I discovered I was partially wrong about this. If you use an externally defined object in the "mapper", Scoobi requires that it be serializable, and will complain at runtime if it isn't. So maybe the right way is just to make my Extractor
serializable and de-serialize it in the Mapper's setup method...
Also, I actually work in Scala, so Scala-based solutions are definitely welcome (if not encouraged!)
I'd suggest telling your mapper which extractor to use via the Configuration
object you're creating. The mapper receives the configuration in its setup
method (context.getConfiguration()
). It seems like you can't put objects in the configuration as it is usually constructed from XML files or the command line, but you could set an enum value and have the mapper construct its extractor itself. Not very pretty to customize the mapper after its creation, but that's how I interpreted the API.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With