I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?
I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.
I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.
Scheduling Delay is the time spent from when the collection of streaming jobs for a batch was submitted to when the first streaming job (out of possibly many streaming jobs in the collection) was started.
Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs.
Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases, and live dashboards.
Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:
inputDStream.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile("hdfs://...")
Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap
and ends at saveAsTextFile
, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4
seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time
. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12
seconds long.
A live example from a working Streaming application:
We see that:
11 - 4 = 7
seconds.7 + 1 = 8
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With