We have some historical data queued up on our topics, we don't want to process all this data in a single batch as that is harder to do (and if it fails it has to start again!).
Also, knowing how to control the batch size would be quite helpful in tuning jobs.
When using DStreams
the way to control the size of the batch as exactly as possible is Limit Kafka batches size when using Spark Streaming
The same approach i.e. setting maxRatePerPartition
and then tuning batchDuration
is extremely cumbersome but works with DStream
it doesn't work at all with Structured Streaming.
Ideally I'd like to know of a config like maxBatchSize
and minBatchSize
, where I can simply set the number of records I'd like.
This config optionmaxOffsetsPerTrigger
:
Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
Note that if you have a checkpoint directory with start and end offsets, then the application will process the offsets in the directory for the first batch, thus ignoring this config. (The next batch will respect it).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With