Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify multiple dependencies using --packages for spark-submit?

Tags:

I have the following as the command line to start a spark streaming job.

    spark-submit --class com.biz.test \             --packages \                 org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \                 org.apache.hbase:hbase-common:1.0.0 \                 org.apache.hbase:hbase-client:1.0.0 \                 org.apache.hbase:hbase-server:1.0.0 \                 org.json4s:json4s-jackson:3.2.11 \             ./test-spark_2.10-1.0.8.jar \             >spark_log 2>&1 & 

The job fails to start with the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0     at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)     at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)     at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)     at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.

According to the docs (spark-submit --help):

The format for the coordinates should be groupId:artifactId:version.

So what I have should be valid and should reference this package.

If it helps, I'm running Cloudera 5.4.4.

What am I doing wrong? How can I reference the hbase packages correctly?

like image 307
davidpricedev Avatar asked Nov 25 '15 23:11

davidpricedev


People also ask

How do I pass multiple jars in spark shell?

Just use the --jars parameter. Spark will share those jars (comma-separated) with the executors.

How do I add a dependency to spark?

The most common method to include this additional dependency is to use --packages argument for the spark-submit command. An example of --packages argument usage is shown in the “Execute” section below. The Apache Spark versions in the build file must match the Spark version in your Spark cluster.

How do we submit jar files in spark?

Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.

How do you add an external jar in spark submit from a local maven repository?

Use --jars option To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them. The following is an example: spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...


2 Answers

A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example

--packages  org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\   org.apache.hbase:hbase-common:1.0.0 
like image 148
zero323 Avatar answered Oct 23 '22 15:10

zero323


I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate() 
like image 43
Mohammad Aqajani Avatar answered Oct 23 '22 13:10

Mohammad Aqajani