How to manually set group.id and commit kafka offsets in spark structured streaming?

1 Answers

tl;dr

It is not possible to commit any messages to Kafka. Starting with Spark version 3.x you can define the name of the Kafka consumer group, however, this still does not allow you to commit any messages.

Since Spark 3.0.0

According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("kafka.group.id", "myConsumerGroup")
  .load()

However, Spark still will not commit any offsets back so you will not be able to "manually" commit offsets to Kafka. This feature is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.

A full example of a Spark 3.x application is discussed and solved here.

Until Spark 2.4.x

The Spark Structured Streaming + Kafka integration Guide clearly states how it manages Kafka offsets. Spark will not commit any messages back to Kafka as it is relying on internal offset management for fault-tolerance.

The most important Kafka configurations for managing offsets are:

group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it.
enable.auto.commit: Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

2.4.x in Action

Let's say you have a simple Spark Structured Streaming application that reads and writes to Kafka, like this:

// create SparkSession
val spark = SparkSession.builder()
  .appName("ListenerTester")
  .master("local[*]")
  .getOrCreate()

// read from Kafka topic
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "testingKafkaProducer")
  .option("failOnDataLoss", "false")
  .load()

// write to Kafka topic and set checkpoint directory for this stream
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "testingKafkaProducerOut")
  .option("checkpointLocation", "/home/.../sparkCheckpoint/")
  .start()

Offset Management by Spark

Once this application is submitted and data is being processed, the corresponding offset can be found in the checkpoint directory:

myCheckpointDir/offsets/

{"testingKafkaProducer":{"0":1}}

Here the entry in the checkpoint file confirms that the next offset of partition 0 to be consumed is 1. It implies that the application already processes offset 0 from partition 0 of the topic named testingKafkaProducer.

More on the fault-tolerance-semantics are given in the Spark Documentation.

Offset Management by Kafka

However, as stated in the documentation, the offset is not committed back to Kafka. This can be checked by executing the kafka-consumer-groups.sh of the Kafka installation.

./kafka/current/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "spark-kafka-source-92ea6f85-[...]-driver-0"

TOPIC                PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID      HOST         CLIENT-ID
testingKafkaProducer 0          -               1               -    consumer-1-[...] /127.0.0.1   consumer-1

The current offset for this application is unknown to Kafka as it has never been committed.

Possible Workaround

Please carefully read the comments below from Spark committer @JungtaekLim about the workaround: "Spark's fault tolerance guarantee is based on the fact Spark has a full control of offset management, and they're voiding the guarantee if they're trying to modify it. (e.g. If they change to commit offset to Kafka, then there's no batch information and if Spark needs to move back to the specific batch "behind" guarantee is no longer valid.)"

What I have seen doing some research on the web is that you could commit offsets in the callback function of the onQueryProgress method in a customized StreamingQueryListener of Spark. That way, you could have a consumer group that keeps track of the current progress. However, its progress is not necessarily aligned with the actual consumer group.

Here are some links you may find helpful:

Code Example for Listener
Discussion on SO around offset management
General description on the StreamingQueryListener

answered Oct 13 '22 17:10

Michael Heil

Related questions
                            
                                How to save a file on the cluster
                            
                                Is sample_n really a random sample when used with sparklyr?
                            
                                How to pre-package external libraries when using Spark on a Mesos cluster
                            
                                Remove Empty Partitions from Spark RDD
                            
                                Spark 1.5.2 and SLF4J StaticLoggerBinder
                            
                                Guava version while using spark-shell
                            
                                Spark Shell - __spark_libs__.zip does not exist
                            
                                Integrate key-value database with Spark
                            
                                What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and spark.driver.hostname?
                            
                                What does df.repartition with no column arguments partition on?
                            
                                Reading HDF5 files [closed]
                            
                                foldLeft or foldRight equivalent in Spark?
                            
                                How to match Dataframe column names to Scala case class attributes?
                            
                                What does stage mean in the spark logs?
                            
                                Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node
                            
                                pyspark Do python processes on an executor node share broadcast variables in ram?
                            
                                cannot resolve xyz given input columns error when creating Spark dataset
                            
                                Creating indices for each group in Spark dataframe
                            
                                java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code
                            
                                multi-processing with spark(PySpark) [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to manually set group.id and commit kafka offsets in spark structured streaming?

Tags:

apache-kafka

apache-spark

spark-structured-streaming

spark-kafka-integration

user3243499

People also ask