Restarting Spark Structured Streaming Job consumes Millions of Kafka messages and dies

Tags:

We have a Spark Streaming Application running on Spark 2.3.3

Basically, it opens a Kafka Stream:

  kafka_stream = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "mykafka:9092") \
  .option("subscribe", "mytopic") \
  .load()

The kafka topic has 2 partitions. After that, there are some basic filtering operations, some Python UDFs and an explode() on a column, like:

   stream = apply_operations(kafka_stream)

where apply_operations does all the work on the data. In the end, we would like to write the stream to a sink, i. e.:

   stream.writeStream \
   .format("our.java.sink.Class") \
   .option("some-option", "value") \
   .trigger(processingTime='15 seconds') \
   .start()

To let this stream operation run forever, we apply:

   spark.streams.awaitAnyTermination()

In the end.

So far, so good. Everything runs for days. But due to a network problem, the job died for a few days, and there are now millions of messages in the kafka stream waiting to be catched up.

When we restart the streaming data job using spark-submit, the first batch will be too large and will take ages to be completed. We thought there might be a way to limit the size of the first batch with some parameter, but we did not find anything that helped.

We tried:

spark.streaming.backpressure.enabled=true along with spark.streaming.backpressure.initialRate=2000 and spark.streaming.kafka.maxRatePerPartition=1000 and spark.streaming.receiver.maxrate=2000
setting spark.streaming.backpressure.pid.minrate to a lower value did not had an effect either
setting the option("maxOffsetsPerTrigger", 10000) did not had an effect as well

Now, after we restart the pipeline, sooner or later the whole Spark Job will crash again. We cannot simply widen the memory or the cores to be used for the spark job.

Is there anything we missed to control the amount of events beeing processed in one stream-batch?

837

asked Apr 02 '19 13:04

Regenschein

1 Answers

You wrote in the comments that you are using spark-streaming-kafka-0-8_2.11 and that api version is not able to handle maxOffsetPerTrigger (or any other mechanism to reduce the number of consumed messages as far as I know) as it was only implemented for the newer api spark-streaming-kafka-0-10_2.11. This newer api also works with your kafka version 0.10.2.2 according to the documentation.

153

answered Oct 25 '22 02:10

cronoik

Related questions
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                Pyspark read csv with schema, header check, and store corrupt records
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD
                            
                                PCA in Spark MLlib and Spark ML
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?
                            
                                How to clean spark history event log with out stopping spark streaming
                            
                                Performance decrease for huge amount of columns. Pyspark
                            
                                Disable spark catalyst optimizer
                            
                                Spark out of memory
                            
                                Does Spark optimize chained transformations?
                            
                                Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'
                            
                                Scalatest Maven Plugin "no tests were executed"
                            
                                "spark.memory.fraction" seems to have no effect
                            
                                When to use Spark DataFrame/Dataset API and when to use plain RDD?
                            
                                Apache Spark Handling Skewed Data
                            
                                Avoid starting HiveThriftServer2 with created context programmatically
                            
                                Can Spark Replace ETL Tool
                            
                                NullPointerException after extracting a Teradata table with Scala/Spark
                            
                                Bundling Python3 packages for PySpark results in missing imports

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Restarting Spark Structured Streaming Job consumes Millions of Kafka messages and dies

Tags:

apache-spark

pyspark

spark-streaming

spark-structured-streaming

Regenschein

People also ask

1 Answers

cronoik

Recent Activity

Donate For Us