Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with -libjars in hadoop

I am trying to run MapReduce job on Hadoop but I am facing an error and I am not sure what is going wrong. I have to pas library jars which is required by my mapper.

I am excuting the following on the terminal:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar /home/hadoop/vardtst.jar -libjars /home/hadoop/clui.jar -libjars /home/hadoop/model.jar gutenberg ou101

and I am getting the following Exception:

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:247)

at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

Please Help ..Thanks

like image 220
Shrish Bajpai Avatar asked Jul 31 '11 14:07

Shrish Bajpai


People also ask

What is the problem with small files in Hadoop?

Problems with small files and HDFS A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you're storing small files, then you probably have lots of them (otherwise you wouldn't turn to Hadoop), and the problem is that HDFS can't handle lots of files.

In which scenario Hadoop is not suitable?

Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.

Which kind of data can not be Handlded in Hadoop?

Although Hadoop is the most powerful tool of big data, there are various limitations of Hadoop like Hadoop is not suited for small files, it cannot handle firmly the live data, slow processing speed, not efficient for iterative processing, not efficient for caching etc.


2 Answers

Also worth to note subtle but important point: the way to specify additional JARs for JVMs running distributed map reduce tasks and for JVM running job client is very different.

  • -libjars makes Jars only available for JVMs running remote map and reduce task

  • To make these same JAR’s available to the client JVM (The JVM that’s created when you run the hadoop jar command) need to set HADOOP_CLASSPATH environment variable:

$ export LIBJARS=/path/jar1,/path/jar2
$ export HADOOP_CLASSPATH=/path/jar1:/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value

See: http://grepalex.com/2013/02/25/hadoop-libjars/

Another cause of incorrect -libjars behaviour could be in wrong implementation and initialization of custom Job class.

  • Job class must implement Tool interface
  • Configuration class instance must be obtained by calling getConf() instead of creating new instance;

See: http://kickstarthadoop.blogspot.ca/2012/05/libjars-not-working-in-custom-mapreduce.html

like image 183
Vladimir Kroz Avatar answered Sep 26 '22 02:09

Vladimir Kroz


When you are specifying the -LIBJARS with the Hadoop jar command. First make sure that you edit your driver class as shown below:

    public class myDriverClass extends Configured implements Tool {

      public static void main(String[] args) throws Exception {
         int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
         System.exit(res);
      }

      public int run(String[] args) throws Exception
      {

        // Configuration processed by ToolRunner 
        Configuration conf = getConf();
        Job job = new Job(conf, "My Job");

        ...
        ...

        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Now edit your "hadoop jar" command as shown below:

hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file

Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.

Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.

Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.

Configuration conf = getConf();

like image 42
Isaiah4110 Avatar answered Sep 24 '22 02:09

Isaiah4110