Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limit Kafka batches size when using Spark Streaming

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?

I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.

like image 314
Samy Dindane Avatar asked Oct 11 '16 15:10

Samy Dindane


People also ask

What is batch size in Spark streaming?

Minimum batch size Spark Streaming can use.is 500 milliseconds, is has proven to be a good minimum size for many applications.

Does Spark streaming support batch operations?

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.

What is the primary difference between Kafka streams and Spark streaming?

Kafka analyses the events as they unfold. As a result, it employs a continuous (event-at-a-time) processing model. Spark, on the other hand, uses a micro-batch processing approach, which divides incoming streams into small batches for processing.


1 Answers

I think your problem can be solved by Spark Streaming Backpressure.

Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.

By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.

From Apache Spark Kafka configuration

spark.streaming.backpressure.enabled:

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate

spark.streaming.backpressure.initialRate:

This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.

This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.

Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.

like image 144
VladoDemcak Avatar answered Oct 25 '22 04:10

VladoDemcak