How to set group.id for consumer group in kafka data source in Structured Streaming?

Tags:

I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation this is not possible. Still, in the databricks documentation https://docs.azuredatabricks.net/spark/latest/structured-streaming/kafka.html#using-ssl, it says that it is possible. Does this only refer to the azure cluster?

Also, by looking at the documentation of the master branch of the apache/spark repo https://github.com/apache/spark/blob/master/docs/structured-streaming-kafka-integration.md, we can understand that such functionality is intended to be added at later spark releases. Do you know of any plans of such a stable release, that is going to allow setting that consumer group.id?

If not, are there any workarounds for Spark 2.4.0 to be able to set a specific consumer group.id?

592

asked Mar 26 '19 10:03

Panagiotis Fytas

1 Answers

Currently (v2.4.0) it is not possible.

You can check following lines in Apache Spark project:

https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81 - generate group.id

https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L534 - set it in properties, that are used to create KafkaConsumer

In master branch you can find modification, that enable to setting prefix or particular group.id

https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L83 - generate group.id based on group prefix (groupidprefix)

https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L543 - set previously generated groupId, if kafka.group.id wasn't passed in properties

answered Sep 24 '22 05:09

Bartosz Wardziński

Related questions
                            
                                How to pre-package external libraries when using Spark on a Mesos cluster
                            
                                Remove Empty Partitions from Spark RDD
                            
                                Spark 1.5.2 and SLF4J StaticLoggerBinder
                            
                                Guava version while using spark-shell
                            
                                Spark Shell - __spark_libs__.zip does not exist
                            
                                Integrate key-value database with Spark
                            
                                What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and spark.driver.hostname?
                            
                                What does df.repartition with no column arguments partition on?
                            
                                Reading HDF5 files [closed]
                            
                                foldLeft or foldRight equivalent in Spark?
                            
                                How to match Dataframe column names to Scala case class attributes?
                            
                                What does stage mean in the spark logs?
                            
                                Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node
                            
                                pyspark Do python processes on an executor node share broadcast variables in ram?
                            
                                cannot resolve xyz given input columns error when creating Spark dataset
                            
                                Creating indices for each group in Spark dataframe
                            
                                java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code
                            
                                multi-processing with spark(PySpark) [duplicate]
                            
                                How to manually set group.id and commit kafka offsets in spark structured streaming?
                            
                                Use of lit() in expr()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set group.id for consumer group in kafka data source in Structured Streaming?

Tags:

apache-kafka

apache-spark

spark-structured-streaming

spark-kafka-integration

Panagiotis Fytas

People also ask

1 Answers

Bartosz Wardziński

Recent Activity

Donate For Us