I am using <code>Kafka 0.8.2</code> to receive data from AdExchange then I use <code>Spark Streaming 1.4.1</code> to store data to <code>MongoDB</code>. My problem is when I restart my <code>Spark Streaming</code> Job for instance like update new version, fix bug, add new features. It will continue read the latest <code>offset</code> of <code>kafka</code> at the time then I will lost data AdX push to kafka during restart the job. I try something like <code>auto.offset.reset -> smallest</code> but it will receive from 0 -> last then data was huge and duplicate in db. I also try to set specific <code>group.id</code> and <code>consumer.id</code> to <code>Spark</code> but it the same. How to save the latest <code>offset</code> spark consumed to <code>zookeeper</code> or <code>kafka</code> then can read back from that to latest <code>offset</code>?

One of the constructors of createDirectStream function can get a map that will hold the partition id as the key and the offset from which you are starting to consume as the value. Just look at api here: http://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html The map that I was talking about usually called: fromOffsets You can insert data to the map: <pre class="prettyprint"><code>startOffsetsMap.put(TopicAndPartition(topicName,partitionId), startOffset) </code></pre> And use it when you create the direct stream: <pre class="prettyprint"><code>KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)]( streamingContext, kafkaParams, startOffsetsMap, messageHandler(_)) </code></pre> After each iteration you can get the processed offsets using: <pre class="prettyprint"><code>rdd.asInstanceOf[HasOffsetRanges].offsetRanges </code></pre> You would be able to use this data to construct the fromOffsets map in the next iteration. You can see the full code and usage here: https://spark.apache.org/docs/latest/streaming-kafka-integration.html at the end of the page

How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart

Tags:

apache-kafka

apache-spark

kafka-consumer-api

spark-streaming

I am using Kafka 0.8.2 to receive data from AdExchange then I use Spark Streaming 1.4.1 to store data to MongoDB.

My problem is when I restart my Spark Streaming Job for instance like update new version, fix bug, add new features. It will continue read the latest offset of kafka at the time then I will lost data AdX push to kafka during restart the job.

I try something like auto.offset.reset -> smallest but it will receive from 0 -> last then data was huge and duplicate in db.

I also try to set specific group.id and consumer.id to Spark but it the same.

How to save the latest offset spark consumed to zookeeper or kafka then can read back from that to latest offset?

664

asked Aug 06 '15 04:08

giaosudau

2 Answers

One of the constructors of createDirectStream function can get a map that will hold the partition id as the key and the offset from which you are starting to consume as the value.

Just look at api here: http://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html The map that I was talking about usually called: fromOffsets

You can insert data to the map:

startOffsetsMap.put(TopicAndPartition(topicName,partitionId), startOffset)

And use it when you create the direct stream:

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
                streamingContext, kafkaParams, startOffsetsMap, messageHandler(_))

After each iteration you can get the processed offsets using:

rdd.asInstanceOf[HasOffsetRanges].offsetRanges

You would be able to use this data to construct the fromOffsets map in the next iteration.

You can see the full code and usage here: https://spark.apache.org/docs/latest/streaming-kafka-integration.html at the end of the page

150

answered Oct 21 '22 06:10

Michael Kopaniov

To add to Michael Kopaniov's answer, if you really want to use ZK as the place you store and load your map of offsets from, you can.

However, because your results are not being output to ZK, you will not get reliable semantics unless your output operation is idempotent (which it sounds like it isn't).

If it's possible to store your results in the same document in mongo alongside the offsets in a single atomic action, that might be better for you.

For more detail, see https://www.youtube.com/watch?v=fXnNEq1v3VA

answered Oct 21 '22 07:10

Cody Koeninger

Related questions
                            
                                SparkR filterRDD and flatMap not working
                            
                                Can do without spark-submit in java?
                            
                                Connecting to remote master on standalone Spark
                            
                                Unable to launch SparkR in RStudio
                            
                                In Spark, is it possible to share data between two executors?
                            
                                Object cache on Spark executors
                            
                                How to flatten the data of different data types by using Sparklyr package?
                            
                                How does Apache spark handle python multithread issues?
                            
                                Use schema to convert AVRO messages with Spark to DataFrame
                            
                                Distributed Map in Scala Spark
                            
                                Apache Spark EOF exception
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Spark Streaming + Kafka: SparkException: Couldn't find leader offsets for Set
                            
                                How to read records in JSON format from Kafka using Structured Streaming?
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?
                            
                                Spark simpler value_counts
                            
                                Spark from_json with dynamic schema
                            
                                How to sort within partitions (and avoid sort across the partitions) using RDD API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With