I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error. This is the command I am using. <code>df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()</code> ERROR: Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html How can I resolve this? NOTE: I am running this in Jupyter Notebook <pre class="prettyprint"><code>findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7') import pyspark from pyspark.sql import SparkSession Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate() from pyspark.sql.types import * from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils </code></pre> Everything is running fine till here (above code) <code>df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()</code> This is where things are going wrong (above code). The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

Edit Using <code>spark.jars.packages</code> works better than <code>PYSPARK_SUBMIT_ARGS</code> Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition <hr> It's not clear how you ran the code. Keep reading the blog, and you see <pre class="prettyprint"><code>spark-submit \ ... --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \ sstreaming-spark-out.py </code></pre> Seems you missed adding the <code>--packages</code> flag In Jupyter, you could add this <pre class="prettyprint"><code>import os # setup arguments os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0' # initialize spark import pyspark, findspark findspark.init() </code></pre> Note: <code>_2.11:2.4.0</code> need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

Pyspark Failed to find data source: kafka

Tags:

apache-kafka

apache-spark

pyspark

spark-streaming-kafka

I am working on Kafka streaming and trying to integrate it with Apache Spark. However, while running I am getting into issues. I am getting the below error.

This is the command I am using.

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

ERROR:

Py4JJavaError: An error occurred while calling o77.load.: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

How can I resolve this?

NOTE: I am running this in Jupyter Notebook

findspark.init('/home/karan/spark-2.1.0-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

Everything is running fine till here (above code)

df_TR = Spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "taxirides").load()

This is where things are going wrong (above code).

The blog which I am following: https://www.adaltas.com/en/2019/04/18/spark-streaming-data-pipelines-with-structured-streaming/

494

asked Nov 06 '19 04:11

P Kernel

1 Answers

Edit

Using spark.jars.packages works better than PYSPARK_SUBMIT_ARGS

Ref - PySpark - NoClassDefFoundError: kafka/common/TopicAndPartition

It's not clear how you ran the code. Keep reading the blog, and you see

spark-submit \
  ...
  --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 \
  sstreaming-spark-out.py

Seems you missed adding the --packages flag

In Jupyter, you could add this

import os

# setup arguments
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'

# initialize spark
import pyspark, findspark
findspark.init()

Note: _2.11:2.4.0 need to align with your Scala and Spark versions... Based on the question, yours should be Spark 2.1.0

183

answered Oct 09 '22 21:10

OneCricketeer

Related questions
                            
                                Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe
                            
                                Spark SQL "Limit"
                            
                                spark-submit config through file
                            
                                Scala/ Spark- Multiply an Integer with each value in a Dataframe Column
                            
                                How to enable Tungsten optimization in Spark 2?
                            
                                Retrieve Spark Mllib StringIndexer column mapping
                            
                                Efficient way to join a cached spark dataframe with other and cache again
                            
                                Is it the driver or the workers who reads the text file when sc.textfile is used?
                            
                                maximum number of columns we can have in dataframe spark scala
                            
                                How to enable spark-history server for standalone cluster non hdfs mode
                            
                                How to use Column.isin with array column in join?
                            
                                Spark SQL - DataFrame - select - transformation or action?
                            
                                AssertionError: all exprs should be Column
                            
                                Read json from Kafka and write json to other Kafka topic
                            
                                Using when and otherwise while converting boolean values to strings in Pyspark
                            
                                Hive bucketing through sparkSQL
                            
                                Transpose a dataframe in Pyspark
                            
                                How to create a spark dataframe with timestamp
                            
                                spark convert dataframe to dataset using case class with option fields
                            
                                How to save csv files faster from pyspark dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With