How to write spark streaming DF to Kafka topic

Tags:

I am using Spark Streaming to process data between two Kafka queues but I can not seem to find a good way to write on Kafka from Spark. I have tried this:

input.foreachRDD(rdd =>   rdd.foreachPartition(partition =>     partition.foreach {       case x: String => {         val props = new HashMap[String, Object]()          props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)         props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,           "org.apache.kafka.common.serialization.StringSerializer")         props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,           "org.apache.kafka.common.serialization.StringSerializer")          println(x)         val producer = new KafkaProducer[String, String](props)         val message = new ProducerRecord[String, String]("output", null, x)         producer.send(message)       }     }   ) )

and it works as intended but instancing a new KafkaProducer for every message is clearly unfeasible in a real context and I'm trying to work around it.

I would like to keep a reference to a single instance for every process and access it when I need to send a message. How can I write to Kafka from Spark Streaming?

857

asked Jul 23 '15 14:07

2 Answers

Yes, unfortunately Spark (1.x, 2.x) doesn't make it straight-forward how to write to Kafka in an efficient manner.

I'd suggest the following approach:

Use (and re-use) one KafkaProducer instance per executor process/JVM.

Here's the high-level setup for this approach:

First, you must "wrap" Kafka's KafkaProducer because, as you mentioned, it is not serializable. Wrapping it allows you to "ship" it to the executors. The key idea here is to use a lazy val so that you delay instantiating the producer until its first use, which is effectively a workaround so that you don't need to worry about KafkaProducer not being serializable.
You "ship" the wrapped producer to each executor by using a broadcast variable.
Within your actual processing logic, you access the wrapped producer through the broadcast variable, and use it to write processing results back to Kafka.

The code snippets below work with Spark Streaming as of Spark 2.0.

Step 1: Wrapping KafkaProducer

import java.util.concurrent.Future  import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}  class MySparkKafkaProducer[K, V](createProducer: () => KafkaProducer[K, V]) extends Serializable {    /* This is the key idea that allows us to work around running into      NotSerializableExceptions. */   lazy val producer = createProducer()    def send(topic: String, key: K, value: V): Future[RecordMetadata] =     producer.send(new ProducerRecord[K, V](topic, key, value))    def send(topic: String, value: V): Future[RecordMetadata] =     producer.send(new ProducerRecord[K, V](topic, value))  }  object MySparkKafkaProducer {    import scala.collection.JavaConversions._    def apply[K, V](config: Map[String, Object]): MySparkKafkaProducer[K, V] = {     val createProducerFunc = () => {       val producer = new KafkaProducer[K, V](config)        sys.addShutdownHook {         // Ensure that, on executor JVM shutdown, the Kafka producer sends         // any buffered messages to Kafka before shutting down.         producer.close()       }        producer     }     new MySparkKafkaProducer(createProducerFunc)   }    def apply[K, V](config: java.util.Properties): MySparkKafkaProducer[K, V] = apply(config.toMap)  }

Step 2: Use a broadcast variable to give each executor its own wrapped KafkaProducer instance

import org.apache.kafka.clients.producer.ProducerConfig  val ssc: StreamingContext = {   val sparkConf = new SparkConf().setAppName("spark-streaming-kafka-example").setMaster("local[2]")   new StreamingContext(sparkConf, Seconds(1)) }  ssc.checkpoint("checkpoint-directory")  val kafkaProducer: Broadcast[MySparkKafkaProducer[Array[Byte], String]] = {   val kafkaProducerConfig = {     val p = new Properties()     p.setProperty("bootstrap.servers", "broker1:9092")     p.setProperty("key.serializer", classOf[ByteArraySerializer].getName)     p.setProperty("value.serializer", classOf[StringSerializer].getName)     p   }   ssc.sparkContext.broadcast(MySparkKafkaProducer[Array[Byte], String](kafkaProducerConfig)) }

Step 3: Write from Spark Streaming to Kafka, re-using the same wrapped KafkaProducer instance (for each executor)

import java.util.concurrent.Future import org.apache.kafka.clients.producer.RecordMetadata  val stream: DStream[String] = ??? stream.foreachRDD { rdd =>   rdd.foreachPartition { partitionOfRecords =>     val metadata: Stream[Future[RecordMetadata]] = partitionOfRecords.map { record =>       kafkaProducer.value.send("my-output-topic", record)     }.toStream     metadata.foreach { metadata => metadata.get() }   } }

Hope this helps.

192

answered Sep 24 '22 12:09

My first advice would be to try to create a new instance in foreachPartition and measure if that is fast enough for your needs (instantiating heavy objects in foreachPartition is what the official documentation suggests).

Another option is to use an object pool as illustrated in this example:

https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/PooledKafkaProducerAppFactory.scala

I however found it hard to implement when using checkpointing.

Another version that is working well for me is a factory as described in the following blog post, you just have to check if it provides enough parallelism for your needs (check the comments section):

http://allegro.tech/2015/08/spark-kafka-integration.html

answered Sep 25 '22 12:09

Marius Soutier

Related questions
                            
                                Private scoping with square brackets (private[...]) in Scala
                            
                                Future[Option] in Scala for-comprehensions
                            
                                Passing function as block of code between curly braces
                            
                                'Unable to load a Suite class' while running ScalaTest in IntelliJ
                            
                                parsing a Json Array in play framework JsObject
                            
                                Eclipse, Android, Scala made easy but still does not work
                            
                                FoldLeft using FoldRight in scala
                            
                                How can I use a Scala singleton object in Java?
                            
                                Map the Exception of a failed Future
                            
                                how do I get sbt to gather all the jar files my code depends on into one place?
                            
                                Scala 2.10 + Json serialization and deserialization
                            
                                Hibernate and Scala [closed]
                            
                                Scala in OSGI container?
                            
                                Converting mutable collection to immutable
                            
                                remove first and last Element from scala.collection.immutable.Iterable[String]
                            
                                Spark Scala list folders in directory
                            
                                DataFrame-ified zipWithIndex
                            
                                Writing Algebraic Data Type in Scala
                            
                                Complete "Scala Logging" Example
                            
                                Scala - modifying nested elements in xml

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write spark streaming DF to Kafka topic

Tags:

scala

apache-kafka

apache-spark

spark-streaming

spark-streaming-kafka

Chobeat

People also ask

2 Answers

Michael G. Noll

Marius Soutier

Recent Activity

Donate For Us