I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
however,
I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.
What is the right approach to read from Kafka in a batch job?
I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest
and saves the checkpoint
to HDFS and then in the next run it starts from that.
But in this case how can I just fetch once and stop streaming after the first batch?
To read from Kafka for streaming queries, we can use function SparkSession. readStream. Kafka server addresses and topic names are required. Spark can subscribe to one or more topics and wildcards can be used to match with multiple topic names similarly as the batch query example provided above.
If you set configuration auto. offset. reset in Kafka parameters to smallest , then it will start consuming from the smallest offset. You can also start consuming from any arbitrary offset using other variations of KafkaUtils.
Accordingly, batch processing can be easily implemented with Apache Kafka, the advantages of Apache Kafka can be used, and the operation can be made efficient.
Kafka is a potential messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of data and are processed using complex algorithms in Spark Streaming.
createRDD
is the right approach for reading a batch from kafka.
To query for info about the latest / earliest available offsets, look at KafkaCluster.scala
methods getLatestLeaderOffsets
and getEarliestLeaderOffsets
. That file was private
, but should be public
in the latest versions of spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With