I was going through the Spark structured streaming - Kafka integration guide here.
It is told at this link that
enable.auto.commit: Kafka source doesn’t commit any offset.
So how do I manually commit offsets once my spark application has successfully processed each record?
Manual commits: You can call a commitSync() or commitAsync() method anytime on the KafkaConsumer . When you issue the call, the consumer will take the offset of the last message received during a poll() and commit that to the Kafka server.
The Kafka consumer commits the offset periodically when polling batches, as described above. This strategy works well if the message processing is synchronous and failures handled gracefully. Be aware that starting Quarkus 1.9, auto commit is disabled by default. So you need to explicitly enable it.
Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.
tl;dr
It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages.
According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id
:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("kafka.group.id", "myConsumerGroup")
.load()
However, Spark still will not commit any offsets back so you will not be able to "manually" commit offsets to Kafka. This feature is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.
A full example of a Spark 3.x application is discussed and solved here.
The Spark Structured Streaming + Kafka integration Guide clearly states how it manages Kafka offsets. Spark will not commit any messages back to Kafka as it is relying on internal offset management for fault-tolerance.
The most important Kafka configurations for managing offsets are:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:
// create SparkSession
val spark = SparkSession.builder()
.appName("ListenerTester")
.master("local[*]")
.getOrCreate()
// read from Kafka topic
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testingKafkaProducer")
.option("failOnDataLoss", "false")
.load()
// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "testingKafkaProducerOut")
.option("checkpointLocation", "/home/.../sparkCheckpoint/")
.start()
Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:
myCheckpointDir/offsets/
{"testingKafkaProducer":{"0":1}}
Here the entry in the checkpoint file confirms that the next offset of partition 0
to be consumed is 1
. It implies that the application already processes offset 0
from partition 0
of the topic named testingKafkaProducer
.
More on the fault-tolerance-semantics are given in the Spark Documentation.
However, as stated in the documentation, the offset is not committed back to Kafka.
This can be checked by executing the kafka-consumer-groups.sh
of the Kafka installation.
./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
testingKafkaProducer 0 - 1 - consumer-1-[...] /127.0.0.1 consumer-1
The current offset for this application is unknown to Kafka as it has never been committed.
Please carefully read the comments below from Spark committer @JungtaekLim about the workaround: "Spark's fault tolerance guarantee is based on the fact Spark has a full control of offset management, and they're voiding the guarantee if they're trying to modify it. (e.g. If they change to commit offset to Kafka, then there's no batch information and if Spark needs to move back to the specific batch "behind" guarantee is no longer valid.)"
What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress
method in a customized StreamingQueryListener
of Spark. That way, you could have a consumer group that keeps track of the current progress. However, its progress is not necessarily aligned with the actual consumer group.
Here are some links you may find helpful:
Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With