With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark. In the Kafka topic, I have json messages (pushed with Streamsets Data Collector). However, I am not able to read it with following code: <pre class="prettyprint"><code>kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() msg=kafka.selectExpr("CAST(value AS STRING)") disp=msg.writeStream.outputMode("append").format("console").start() </code></pre> It generates this error : <pre class="prettyprint"><code> java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer </code></pre> I tried to add at the readStream line: <pre class="prettyprint"><code>.option("value.serializer","org.common.serialization.StringSerializer") .option("key.serializer","org.common.serialization.StringSerializer") </code></pre> But it does not solve the problem. Any idea ? Thank you in advance.

Actually I found the solution: I added the following jar in dependency: spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar (after having downloaded it from https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10-assembly_2.10/2.1.0)

Kafka with Spark 2.1 Structured Streaming - cannot deserialize

Tags:

deserialization

apache-spark

apache-spark-sql

pyspark

spark-streaming

With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark.

In the Kafka topic, I have json messages (pushed with Streamsets Data Collector). However, I am not able to read it with following code:

kafka=spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers","localhost:6667") \
.option("subscribe","mytopic").load()
msg=kafka.selectExpr("CAST(value AS STRING)")
disp=msg.writeStream.outputMode("append").format("console").start()

It generates this error :

 java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArrayDeserializer

I tried to add at the readStream line:

.option("value.serializer","org.common.serialization.StringSerializer")
.option("key.serializer","org.common.serialization.StringSerializer")

But it does not solve the problem.

Any idea ? Thank you in advance.

959

asked Mar 30 '17 06:03

JS G.

1 Answers

Actually I found the solution: I added the following jar in dependency:

spark-streaming-kafka-0-10-assembly_2.10-2.1.0.jar

(after having downloaded it from https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10-assembly_2.10/2.1.0)

answered Sep 18 '22 03:09

JS G.

Related questions
                            
                                Spark: out of memory when broadcasting objects
                            
                                What type should I declare a DateTime object in a scala class constructor?
                            
                                aggregate Dataframe pyspark
                            
                                Registering Hive Custom UDF with Spark (Spark SQL) 2.0.0
                            
                                How to read and write data in Google Cloud Bigtable in PySpark application?
                            
                                How to Connect Python to Spark Session and Keep RDDs Alive
                            
                                SparkContext class not found error
                            
                                Pyspark append executor environment variable
                            
                                Testing Spark with pytest - cannot run Spark in local mode
                            
                                SparkSession and context confusion
                            
                                Spark Python: Standard scaler error "Do not support ... SparseVector"
                            
                                is there any pyspark function for add next month like DATE_ADD(date, month(int type))
                            
                                What is the use of queryExecution in spark dataframe?
                            
                                Apache Spark UDF that returns dynamic data types
                            
                                How to save bucketed DataFrame?
                            
                                how to list spark-packages added to the spark context?
                            
                                UDF to map words to term Index in Spark
                            
                                how does YARN "Fair Scheduler" work with spark-submit configuration parameter
                            
                                how to change column value in spark sql
                            
                                How to write streaming dataset to Kafka?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Kafka with Spark 2.1 Structured Streaming - cannot deserialize

Tags:

deserialization

apache-spark

apache-spark-sql

pyspark

spark-streaming

JS G.

People also ask

1 Answers

JS G.

Recent Activity

Donate For Us