Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark output to kafka exactly-once

I want to output spark and spark streaming to kafka exactly-once . But as the doc says "Output operations (like foreachRDD) have at-least once semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. ".
To do transactional updates, spark recommends to use the batch time (available in foreachRDD) and the partition index of the RDD to create an identifier. This identifier uniquely identifies a blob data in the streaming application. Code listed below:

dstream.foreachRDD { (rdd, time) =>
  rdd.foreachPartition { partitionIterator =>
    val partitionId = TaskContext.get.partitionId()
    val **uniqueId** = generateUniqueId(time.milliseconds, partitionId)
    // use this uniqueId to transactionally commit the data in  partitionIterator
  }
}

But how to use the uniqueId in kafka to make transactionally committing.

Thanks

like image 452
bforevdr Avatar asked Nov 08 '22 09:11

bforevdr


1 Answers

An exactly-once solution with Kafka was discussed at a Spark Summit by Cody Koeninger, a Senior Software Engineer at Kixer. Essentially, this solution involves storing offsets and data with a simultaneous commit.

In mentioning the exactly once topic to engineers at a Confluent meetup in 2016, the engineers referenced Cody's lecture on this topic. Cloudera published his lecture at http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/. Cody's paper is at http://koeninger.github.io/kafka-exactly-once/#1 and his github (for this topic) is at https://github.com/koeninger/kafka-exactly-once. There are also videos of his lecture that can be found on the web.

Later versions of Kafka introduce Kafka Streams to take care of the exactly-once scenario without Spark, but that topic is only worth a footnote as the frame of the question is to work with Spark.

like image 125
codeaperature Avatar answered Nov 14 '22 21:11

codeaperature