I am using Python 3.5 and Spark 2.2 Streaming with Kafka and the script was unable to run due to missing kafka libraries.
I am puzzled why the library was missing/not found even though the dependency information was from Spark's website itself.
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.2.0
I ran "spark-submit script.py" and the error shows that kafka library is required.
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.0 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
On the next run, I ran "spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10:2.2.0 script.py" with the kafka library to be downloaded.
This time round the error shows that it is not able to find/download the library.
Ivy Default Cache set to: C:\Users\james\.ivy2\cache
The jars for the packages stored in: C:\Users\james\.ivy2\jars
:: loading settings :: url = jar:file:/D:/programs/spark-2.2.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 2908ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: org.apache.spark#spark-streaming-kafka-0-10;2.2.0
==== local-m2-cache: tried
file:/C:/Users/james/.m2/repository/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom
-- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:
file:/C:/Users/james/.m2/repository/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar
==== local-ivy-cache: tried
C:\Users\james\.ivy2\local\org.apache.spark\spark-streaming-kafka-0-10\2.2.0\ivys\ivy.xml
-- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:
C:\Users\james\.ivy2\local\org.apache.spark\spark-streaming-kafka-0-10\2.2.0\jars\spark-streaming-kafka-0-10.jar
==== central: tried
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom
-- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom
-- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:
http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.apache.spark#spark-streaming-kafka-0-10;2.2.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.spark#spark-streaming-kafka-0-10;2.2.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Submiting to SparkStart Kafka Producer CLI (explained in the previous chapter), create a new topic called my-first-topic and provide some sample messages as shown below. Run the following command to submit the application to spark console. The sample output of this application is shown below. spark console messages ..
To read from Kafka for streaming queries, we can use function SparkSession. readStream. Kafka server addresses and topic names are required. Spark can subscribe to one or more topics and wildcards can be used to match with multiple topic names similarly as the batch query example provided above.
First: As discussed on Developers Mailing list, Kafka is not included in binary distribution. That is why you don't have it on classpath.
Second: in your --packages
command, you should specify Scala version. It's not necessary only in SBT, but spark-submit
uses Ivy in the background.
So, please try:
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 script.py
Extra point: Maybe I will create a PR to change description, it's misleading
Try to write
bin/spark-submit --jars yourjarfile.jar --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3 pythoncode.py
I had the same problem and I solved it typing like this. I hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With