How to specify multiple dependencies using --packages for spark-submit?

Tags:

I have the following as the command line to start a spark streaming job.

    spark-submit --class com.biz.test \             --packages \                 org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \                 org.apache.hbase:hbase-common:1.0.0 \                 org.apache.hbase:hbase-client:1.0.0 \                 org.apache.hbase:hbase-server:1.0.0 \                 org.json4s:json4s-jackson:3.2.11 \             ./test-spark_2.10-1.0.8.jar \             >spark_log 2>&1 &

The job fails to start with the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0     at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665)     at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432)     at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288)     at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87)     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10 to the end of the artifactId, etc.

According to the docs (spark-submit --help):

The format for the coordinates should be groupId:artifactId:version.

So what I have should be valid and should reference this package.

If it helps, I'm running Cloudera 5.4.4.

What am I doing wrong? How can I reference the hbase packages correctly?

307

asked Nov 25 '15 23:11

davidpricedev

2 Answers

A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example

--packages  org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\   org.apache.hbase:hbase-common:1.0.0

148

answered Oct 23 '22 15:10

zero323

I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()

answered Oct 23 '22 13:10

Mohammad Aqajani

Related questions
                            
                                Set timezone in PHP and MySQL
                            
                                Python: ImportError: No module named 'HTMLParser' [duplicate]
                            
                                Can virtual functions be constexpr?
                            
                                difference between alert("Hi!") and function(){alert("Hi!")} [duplicate]
                            
                                Vue Resource root options not used?
                            
                                Python Selenium Timeout Exception Catch
                            
                                Escaping null byte next to two zeroes
                            
                                Null conditional operator to "nullify" array element existence
                            
                                Pandas - Sorting a dataframe by using datetimeindex
                            
                                How to add tap gesture to UICollectionView , while maintaining cell selection?
                            
                                Cannot import exported interface - export not found
                            
                                Reading whole Google Spreadsheet with Sheets API v4 Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With