I want to output spark and spark streaming to kafka exactly-once . But as the doc says
"Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. ".
To do transactional updates, spark recommends to use the batch time (available in foreachRDD) and the partition index of the RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. Code listed below:
dstream.foreachRDD { (rdd, time) =>
rdd.foreachPartition { partitionIterator =>
val partitionId = TaskContext.get.partitionId()
val **uniqueId** = generateUniqueId(time.milliseconds, partitionId)
// use this uniqueId to transactionally commit the data in partitionIterator
}
}
But how to use the uniqueId in kafka to make transactionally committing.
Thanks
An exactly-once solution with Kafka was discussed at a Spark Summit by Cody Koeninger, a Senior Software Engineer at Kixer. Essentially, this solution involves storing offsets and data with a simultaneous commit.
In mentioning the exactly once topic to engineers at a Confluent meetup in 2016, the engineers referenced Cody's lecture on this topic. Cloudera published his lecture at http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/. Cody's paper is at http://koeninger.github.io/kafka-exactly-once/#1 and his github (for this topic) is at https://github.com/koeninger/kafka-exactly-once. There are also videos of his lecture that can be found on the web.
Later versions of Kafka introduce Kafka Streams to take care of the exactly-once scenario without Spark, but that topic is only worth a footnote as the frame of the question is to work with Spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With