I have the following as the command line to start a spark streaming job.
spark-submit --class com.biz.test \ --packages \ org.apache.spark:spark-streaming-kafka_2.10:1.3.0 \ org.apache.hbase:hbase-common:1.0.0 \ org.apache.hbase:hbase-client:1.0.0 \ org.apache.hbase:hbase-server:1.0.0 \ org.json4s:json4s-jackson:3.2.11 \ ./test-spark_2.10-1.0.8.jar \ >spark_log 2>&1 &
The job fails to start with the following error:
Exception in thread "main" java.lang.IllegalArgumentException: Given path is malformed: org.apache.hbase:hbase-common:1.0.0 at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1665) at org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:432) at org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:288) at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:87) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've tried removing the formatting and returning to a single line, but that doesn't resolve the issue. I've also tried a bunch of variations: different versions, added _2.10
to the end of the artifactId, etc.
According to the docs (spark-submit --help
):
The format for the coordinates should be groupId:artifactId:version.
So what I have should be valid and should reference this package.
If it helps, I'm running Cloudera 5.4.4.
What am I doing wrong? How can I reference the hbase packages correctly?
Just use the --jars parameter. Spark will share those jars (comma-separated) with the executors.
The most common method to include this additional dependency is to use --packages argument for the spark-submit command. An example of --packages argument usage is shown in the “Execute” section below. The Apache Spark versions in the build file must match the Spark version in your Spark cluster.
Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.
Use --jars option To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them. The following is an example: spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...
A list of packages should be separated using commas without whitespaces (breaking lines should work just fine) for example
--packages org.apache.spark:spark-streaming-kafka_2.10:1.3.0,\ org.apache.hbase:hbase-common:1.0.0
I found it worthy to use SparkSession in spark version 3.0.0 for mysql and postgres
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('mysql-postgres').config('spark.jars.packages', 'mysql:mysql-connector-java:8.0.20,org.postgresql:postgresql:42.2.16').getOrCreate()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With