Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

I think your problem can be solved by Spark Streaming Backpressure. Check <code>spark.streaming.backpressure.enabled</code> and <code>spark.streaming.backpressure.initialRate</code>. By default <code>spark.streaming.backpressure.initialRate</code> is not set and <code>spark.streaming.backpressure.enabled</code> is disabled by default so I suppose spark will take as much as he can. From Apache Spark Kafka configuration <code>spark.streaming.backpressure.enabled</code>: <blockquote> This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values <code>spark.streaming.receiver.maxRate</code> and <code>spark.streaming.kafka.maxRatePerPartition</code> if they are set (see below). </blockquote> And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need <code>spark.streaming.backpressure.initialRate</code> <code>spark.streaming.backpressure.initialRate</code>: <blockquote> This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled. </blockquote> This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages. Maybe you will be also interested to check <code>spark.streaming.kafka.maxRatePerPartition</code> and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

Limit Kafka batches size when using Spark Streaming

1 Answers

I think your problem can be solved by Spark Streaming Backpressure.

Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.

By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.

From Apache Spark Kafka configuration

spark.streaming.backpressure.enabled:

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate:

This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.

This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.

Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

144

answered Oct 25 '22 04:10

VladoDemcak

Related questions
                            
                                PySpark Throwing error Method __getnewargs__([]) does not exist
                            
                                How to remove nulls with array_remove Spark SQL Built-in Function
                            
                                What factors decide the number of executors in a stand alone mode?
                            
                                AbstractMethodError creating Kafka stream
                            
                                How to run multiple Spark jobs in parallel?
                            
                                Spark gives a StackOverflowError when training using ALS
                            
                                Casting a new derived column in a DataFrame from boolean to integer
                            
                                Spark SQL converting string to timestamp
                            
                                How to get keys and values from MapType column in SparkSQL DataFrame
                            
                                Is there a way to add extra metadata for Spark dataframes?
                            
                                Applying Mapping Function on DataFrame
                            
                                PySpark add a column to a DataFrame from a TimeStampType column
                            
                                RDD Aggregate in spark
                            
                                Spark RDD - is partition(s) always in RAM?
                            
                                How can I get from 'pyspark.sql.types.Row' all the columns/attributes name?
                            
                                how to select all columns that starts with a common label
                            
                                Standalone Manager Vs. Yarn Vs. Mesos
                            
                                The system cannot find the path specified error while running pyspark
                            
                                Spark UDF with varargs
                            
                                Trouble building a simple SparkSQL application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Limit Kafka batches size when using Spark Streaming

Tags:

apache-kafka

apache-spark

kafka-consumer-api

spark-streaming

Samy Dindane

People also ask

1 Answers

VladoDemcak

Recent Activity

Donate For Us