Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Spark Structured Streaming handles backpressure?

I'm analyzing the backpressure feature on Spark Structured Streaming. Does anyone know the details? Is it possible to tune process incoming records by code? Thanks

like image 375
Aniello Guarino Avatar asked Jul 02 '17 14:07

Aniello Guarino


People also ask

What is Spark streaming backpressure?

Backpressure refers to the situation where a system is receiving data at a higher rate than it can process during a temporary load spike. If there is a sudden spike in traffic, this could cause bottlenecks in downstream dependencies, that slows down the stream processing.

What is the use of backpressure configuration in Spark streaming?

In summary, enabling backpressure is an important technique to make your spark streaming application production ready. It dynamically set the message ingestion rate based on previous batch performance, thus making your spark streaming application stable and efficient, without the pitfall of statically capped max rate.

How does Spark streaming work internally?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.


2 Answers

If you mean dynamically changing the size of each internal batch in Structured Streaming, then NO. There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.

Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source. Read the following links for more details:

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

like image 144
zsxwing Avatar answered Sep 25 '22 21:09

zsxwing


Handling back pressure is needed only is push based mechanisms. Kafka consumers are pull based, spark will pull next batch of records only when current batch is finished processing and saving. If processing & saving is delayed in spark, it won't pull new batch of records so no need of back pressure handling.

maxOffsetsPerTrigger can change the number of records processed per spark batch set, backpressure.enabled changes rate of receiving, but that's not same as back pressure where you go and tell the source to slow dow.

like image 24
spats Avatar answered Sep 23 '22 21:09

spats