How to read records in JSON format from Kafka using Structured Streaming?

Tags:

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.

I use:

Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10

Spark Kafka DataSource has defined underlying schema:

|key|value|topic|partition|offset|timestamp|timestampType|

My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:

 val columns = Array("column1", "column2") // column names
 val rawKafkaDF = sparkSession.sqlContext.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers","localhost:9092")
  .option("subscribe",topic)
  .load()
  val columnsToSelect = columns.map( x => new Column("value." + x))
  val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)

  // some analytics using stream dataframe kafkaDF

  val query = kafkaDF.writeStream.format("console").start()
  query.awaitTermination()

Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...

Do you have any suggestions?

360

asked Apr 08 '17 17:04

Stefan Repcek

1 Answers

From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.

If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json

val schema: StructType = StructType(Seq(
  StructField("column1", ???),
  StructField("column2", ???)
))

rawKafkaDF.select(from_json($"value".cast(StringType), schema))

or cast to StringType, extract fields by path using get_json_object:

import org.apache.spark.sql.functions.get_json_object

val columns: Seq[String] = ???

val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))

rawKafkaDF.select(exprs: _*)

and cast later to the desired types.

135

answered Oct 05 '22 00:10

zero323

Related questions
                            
                                How do I turn off the Scala Fast Compilation server's (FSC) timeout?
                            
                                "Modular" Scala guidelines
                            
                                Scala: Can I declare a public field that will not generate getters and setters when compiled?
                            
                                Iteratees in Scala that use lazy evaluation or fusion?
                            
                                Using for-comprehension, Try and sequences in Scala
                            
                                For comprehension and number of function creation
                            
                                Distributed Map in Scala Spark
                            
                                Scala Function.tupled and Function.untupled equivalent for variable arity, or, calling variable arity function with tuple
                            
                                Omitting curly braces in scala for multiple lines
                            
                                When does a += b become a = a + b in Scala?
                            
                                non-greedy matching in Scala RegexParsers
                            
                                Scala Auxiliary constructors
                            
                                How to read a chain of Scala function parameters
                            
                                Scala compiler says "error: identifier expected but integer literal found." for () not {}
                            
                                Scala IDE (Eclipse) Run as Scala Application
                            
                                Apache Spark EOF exception
                            
                                Where should actor messages be declared?
                            
                                Case to case inheritance in Scala
                            
                                How to get rid of scalac ServerException with IntelliJ Idea
                            
                                Scala Stream vs Java Stream Laziness Difference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read records in JSON format from Kafka using Structured Streaming?

Tags:

scala

apache-kafka

apache-spark

apache-spark-sql

spark-structured-streaming

Stefan Repcek

People also ask

1 Answers

zero323

Recent Activity

Donate For Us