Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark ClassNotFoundException for a dependency

I added a third-party jar in some spark project. The issue is that intelliJ compiles and runs the code cleanly. But, when I submit it in the cluster with

./bin/spark-submit --master yarn --class myClass my.jar input output_files 

I get

java.lang.NoClassDefFoundError: gov/nih/nlm/nls/metamap/MetaMapApi
    at metamap.MetaProcess$2.call(MetaProcess.java:46)
    at metamap.MetaProcess$2.call(MetaProcess.java:28)
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:700)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:700)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: gov.nih.nlm.nls.metamap.MetaMapApi
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

I tried to add the jar's third-party library specifying setJar to SparkContext instance but it didn't work. Then I added the dependency as a maven one but it didn't help much either. At last, I tried to specify the --jars command line option but with no success. Can someone help

like image 260
epsilones Avatar asked Nov 08 '22 21:11

epsilones


1 Answers

Here are options available along with the commands: -

  1. Create a single fat jar file with all dependencies and use it with spark-submit command like shown below: -

    ./bin/spark-submit --class <MAIN-CLASS> --master yarn --deploy-mode cluster <PATH TO APP JAR FIlE>
    
  2. Copy the jar file to http:// or FTP:// or HDFS:// and then leverage the SparkConf.setJars and specify the full path for example - SparkConf.setJars(Array("http://mydir/one.jar")) and finally use the spark-submit command in same fashion with no changes: -

    ./bin/spark-submit --class <MAIN-CLASS> --master yarn --deploy-mode cluster <PATH TO APP JAR FILE>
    
  3. Copy the jar file to http:// or FTP:// or HDFS:// and then use the spark-submit command with --jars option: -

    ./bin/spark-submit --class <MAIN-CLASS> --jars <http://mydir/one.jar,http://mydir/two.jar> --master yarn --deploy-mode cluster <PATH TO APP JAR FILE>
    
  4. Specify the variables spark.driver.extraClassPathand spark.executor.extraClassPath in SPARK_HOME/spark-default.conf file with the values as the full path of dpendent jar files and then use the same spark-submit command as used in #1

Depending upon the use case any one of the option will work. In case none works then you may need to recheck the provided dependencies to the spark-submit command

For more information refer here for submitting the applcations

like image 153
Sumit Avatar answered Nov 14 '22 21:11

Sumit