Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Kafka topic in a Spark batch job

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.

What is the right approach to read from Kafka in a batch job?

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.

But in this case how can I just fetch once and stop streaming after the first batch?

like image 272
Bruckwald Avatar asked Jun 25 '16 08:06

Bruckwald


People also ask

How do I read Kafka topic from Spark?

To read from Kafka for streaming queries, we can use function SparkSession. readStream. Kafka server addresses and topic names are required. Spark can subscribe to one or more topics and wildcards can be used to match with multiple topic names similarly as the batch query example provided above.

How do you consume data from Kafka topic in Spark streaming?

If you set configuration auto. offset. reset in Kafka parameters to smallest , then it will start consuming from the smallest offset. You can also start consuming from any arbitrary offset using other variations of KafkaUtils.

Can we use Kafka for batch processing?

Accordingly, batch processing can be easily implemented with Apache Kafka, the advantages of Apache Kafka can be used, and the operation can be made efficient.

Can Kafka and Spark be used together?

Kafka is a potential messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of data and are processed using complex algorithms in Spark Streaming.


1 Answers

createRDD is the right approach for reading a batch from kafka.

To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets. That file was private, but should be public in the latest versions of spark.

like image 158
Cody Koeninger Avatar answered Sep 18 '22 08:09

Cody Koeninger