Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop: Easy way to have object as output value without Writable interface

I am trying to exploit hadoop to train multiple models . My data are small enough to fit in memory so i want to have one model trained in every map task.

My problem is that when i have finished training my model, i need to send it to the reducer. I am using Weka to train the model. I don't want to start looking how to implement the Writable interface in Weka classes, because it needs a lot of effort. I am looking for a simple way to do this.

The Classifier class in Weka implements the Serializable interface. How can i send this object to the reducer?

        edits

Here is the link that mentions weka objects serialization: http://weka.wikispaces.com/Serialization

Here is what my code looks like: Configuring the job(only a part of the configuration is posted):

       conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization," + "org.apache.hadoop.io.serializer.WritableSerialization"); 
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(Classifier.class);

Map function:

     //load dataset in data variable
     Classifier tree=new J48();
     tree.buildClassifier();
     context.write(new Text("whatever"), tree);

My Map class extends Mapper (Object,Text,Text,Classifier)

But i am getting this error:

     java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)

What i am doing wrong??

like image 206
jojoba Avatar asked Mar 28 '12 18:03

jojoba


1 Answers

You can define your own serialization mechanism

  • http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html
  • https://issues.apache.org/jira/browse/HADOOP-1986

I think it resolves around implementing the Serialization interface, and defining your implementation in the io.serializations configuration property

In your case, if you just want to use java serialization, set this property to:

  • org.apache.hadoop.io.serializer.JavaSerialization
like image 147
Chris White Avatar answered Nov 07 '22 04:11

Chris White