Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark unable to download kafka library

I am using Python 3.5 and Spark 2.2 Streaming with Kafka and the script was unable to run due to missing kafka libraries.

I am puzzled why the library was missing/not found even though the dependency information was from Spark's website itself.

groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.11
version = 2.2.0

I ran "spark-submit script.py" and the error shows that kafka library is required.

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.0 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.0.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

On the next run, I ran "spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10:2.2.0 script.py" with the kafka library to be downloaded.

This time round the error shows that it is not able to find/download the library.

Ivy Default Cache set to: C:\Users\james\.ivy2\cache
The jars for the packages stored in: C:\Users\james\.ivy2\jars
:: loading settings :: url = jar:file:/D:/programs/spark-2.2.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.spark#spark-streaming-kafka-0-10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
:: resolution report :: resolve 2908ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: org.apache.spark#spark-streaming-kafka-0-10;2.2.0

        ==== local-m2-cache: tried

          file:/C:/Users/james/.m2/repository/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom

          -- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:

          file:/C:/Users/james/.m2/repository/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar

        ==== local-ivy-cache: tried

          C:\Users\james\.ivy2\local\org.apache.spark\spark-streaming-kafka-0-10\2.2.0\ivys\ivy.xml

          -- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:

          C:\Users\james\.ivy2\local\org.apache.spark\spark-streaming-kafka-0-10\2.2.0\jars\spark-streaming-kafka-0-10.jar

        ==== central: tried

          https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom

          -- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:

          https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar

        ==== spark-packages: tried

          http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.pom

          -- artifact org.apache.spark#spark-streaming-kafka-0-10;2.2.0!spark-streaming-kafka-0-10.jar:

          http://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-streaming-kafka-0-10/2.2.0/spark-streaming-kafka-0-10-2.2.0.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::          UNRESOLVED DEPENDENCIES         ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: org.apache.spark#spark-streaming-kafka-0-10;2.2.0: not found

                ::::::::::::::::::::::::::::::::::::::::::::::



:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.spark#spark-streaming-kafka-0-10;2.2.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
        at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
like image 315
ilovetolearn Avatar asked Aug 27 '18 13:08

ilovetolearn


People also ask

Can Kafka and Spark used together?

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

How do I transfer data from Kafka to Spark?

Submiting to SparkStart Kafka Producer CLI (explained in the previous chapter), create a new topic called my-first-topic and provide some sample messages as shown below. Run the following command to submit the application to spark console. The sample output of this application is shown below. spark console messages ..

How do you read data from Kafka topic in Spark?

To read from Kafka for streaming queries, we can use function SparkSession. readStream. Kafka server addresses and topic names are required. Spark can subscribe to one or more topics and wildcards can be used to match with multiple topic names similarly as the batch query example provided above.


2 Answers

First: As discussed on Developers Mailing list, Kafka is not included in binary distribution. That is why you don't have it on classpath.

Second: in your --packages command, you should specify Scala version. It's not necessary only in SBT, but spark-submit uses Ivy in the background.

So, please try:

  $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 script.py

Extra point: Maybe I will create a PR to change description, it's misleading

like image 77
T. Gawęda Avatar answered Sep 21 '22 11:09

T. Gawęda


Try to write

bin/spark-submit --jars yourjarfile.jar --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.3 pythoncode.py

I had the same problem and I solved it typing like this. I hope that helps.

like image 30
Nijat Mursali Avatar answered Sep 19 '22 11:09

Nijat Mursali