I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic. For this I can use <code>org.apache.spark.streaming.kafka.KafkaUtils#createRDD</code> however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job. What is the right approach to read from Kafka in a batch job? I'm also thinking about writing a streaming job instead, which reads from <code>auto.offset.reset=smallest</code> and saves the checkpoint to HDFS and then in the next run it starts from that. But in this case how can I just fetch once and stop streaming after the first batch?

<code>createRDD</code> is the right approach for reading a batch from kafka. To query for info about the latest / earliest available offsets, look at <code>KafkaCluster.scala</code> methods <code>getLatestLeaderOffsets</code> and <code>getEarliestLeaderOffsets</code>. That file was <code>private</code>, but should be <code>public</code> in the latest versions of spark.

Read Kafka topic in a Spark batch job

Tags:

scala

apache-kafka

apache-spark

kafka-consumer-api

spark-streaming

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.

What is the right approach to read from Kafka in a batch job?

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.

But in this case how can I just fetch once and stop streaming after the first batch?

272

asked Jun 25 '16 08:06

Bruckwald

1 Answers

createRDD is the right approach for reading a batch from kafka.

To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets. That file was private, but should be public in the latest versions of spark.

158

answered Sep 18 '22 08:09

Cody Koeninger

Related questions
                            
                                Scala Reflection: How to list all variables in scope?
                            
                                Scala - how to create anonymous class and avoid hiding argument names
                            
                                Difference between protected[package] and private[package] for a singleton object
                            
                                Can not see some local variables in debugger within intellij for some scala programs
                            
                                Scala Nested Futures
                            
                                Having sbt to re-run on file changes - The `~ compile` equivalent for `run`
                            
                                combineLatest emit only when one of the streams changes
                            
                                Spark Streaming groupByKey and updateStateByKey implementation
                            
                                Finding implicit method definitions in macro context
                            
                                Spark Task not serializable (Case Classes)
                            
                                Android/Scala project in IntelliJ 14 compiles, but crashes when launched not finding Scala class
                            
                                Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?
                            
                                Why does Future.onSuccess require a partial function
                            
                                Implicit Resolution Failure?
                            
                                how to build a graph from tuples in graphx and label the nodes after ?
                            
                                Subtype for a table element in a Scala Slick Query
                            
                                How to do Slick configuration via application.conf from within custom sbt task?
                            
                                How do you compose tasks in sbt?
                            
                                Why does a for-comprehension used with an extractor of type tuple result in a compile warning on `filter`?
                            
                                Scala top level package object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With