Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

Tags:

I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11).

Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it.

in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer.

in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that means.

I looked at examples such as this example but my Avro object in Kafka is quit complex and cannot be simply casted like the String in the example..

So far I tried this kind of code (which I saw here in a different question):

import spark.implicits._

  val ds1 = spark.readStream.format("kafka").
    option("kafka.bootstrap.servers","localhost:9092").
    option("subscribe","RED-test-tal4").load()

  ds1.printSchema()
  ds1.select("value").printSchema()
  val ds2 = ds1.select($"value".cast(getDfSchemaFromAvroSchema(Obj.getClassSchema))).show()  
  val query = ds2.writeStream
    .outputMode("append")
    .format("console")
    .start()

and I get "data type mismatch: cannot cast BinaryType to StructType(StructField(...."

how can I deserialize the value?

934

asked Nov 20 '16 15:11

Tal Joffe

1 Answers

As noted above, as of Spark 2.1.0 there is support for avro with the batch reader but not with SparkSession.readStream(). Here is how I got it to work in Scala based on the other responses. I've simplified the schema for brevity.

package com.sevone.sparkscala.mypackage

import org.apache.spark.sql._
import org.apache.avro.io.DecoderFactory
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}

object MyMain {

    // Create avro schema and reader
    case class KafkaMessage (
        deviceId: Int,
        deviceName: String
    )
    val schemaString = """{
        "fields": [
            { "name":  "deviceId",      "type": "int"},
            { "name":  "deviceName",    "type": "string"},
        ],
        "name": "kafkamsg",
        "type": "record"
    }"""
    val messageSchema = new Schema.Parser().parse(schemaString)
    val reader = new GenericDatumReader[GenericRecord](messageSchema)
    // Factory to deserialize binary avro data
    val avroDecoderFactory = DecoderFactory.get()
    // Register implicit encoder for map operation
    implicit val encoder: Encoder[GenericRecord] = org.apache.spark.sql.Encoders.kryo[GenericRecord]

    def main(args: Array[String]) {

        val KafkaBroker =  args(0);
        val InTopic = args(1);
        val OutTopic = args(2);

        // Get Spark session
        val session = SparkSession
                .builder
                .master("local[*]")
                .appName("myapp")
                .getOrCreate()

        // Load streaming data
        import session.implicits._
        val data = session
                .readStream
                .format("kafka")
                .option("kafka.bootstrap.servers", KafkaBroker)
                .option("subscribe", InTopic)
                .load()
                .select($"value".as[Array[Byte]])
                .map(d => {
                    val rec = reader.read(null, avroDecoderFactory.binaryDecoder(d, null))
                    val deviceId = rec.get("deviceId").asInstanceOf[Int]
                    val deviceName = rec.get("deviceName").asInstanceOf[org.apache.avro.util.Utf8].toString
                    new KafkaMessage(deviceId, deviceName)
                })

answered Sep 20 '22 20:09

Ralph Gonzalez

Related questions
                            
                                Scala extending while loops to do-until expressions
                            
                                The difference between scala script and application
                            
                                Are Multisets missing in Scala?
                            
                                Spark Configuration: SPARK_MEM vs. SPARK_WORKER_MEMORY
                            
                                How to use >=> in Scala?
                            
                                Building a generic DAO for slick
                            
                                Play framework routes, and scala predef values
                            
                                How to integrate Sass and Play 2.3?
                            
                                Catching Json exceptions with play-json library
                            
                                NotSerializableException with json4s on Spark
                            
                                What is the inverse of a promise?
                            
                                How can I create a separate sbt Configuration or Task to compile with WartRemover?
                            
                                Spark - Checkpointing implication on performance
                            
                                Slick/Scala: What is a Rep[Bind] and how do I turn it into a value?
                            
                                Why scala class need to explicitly extend AnyRef
                            
                                Get all the nodes connected to a node in Apache Spark GraphX
                            
                                Close akka-http websocket connection from server
                            
                                Scala: How to use case keyword without match keyword?
                            
                                How can I use a library dependency in the definition of an sbt task?
                            
                                Cleaning up `case class` with `Option` FIelds

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

Tags:

scala

apache-kafka

avro

apache-spark-2.0

spark-streaming

Tal Joffe

People also ask

1 Answers

Ralph Gonzalez

Recent Activity

Donate For Us