I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ? I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means. I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count: <pre class="prettyprint"><code>inputDStream.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .saveAsTextFile("hdfs://...") </code></pre> <ul> <li> Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at <code>flatMap</code> and ends at <code>saveAsTextFile</code>, and assumes as a prerequisite that the job has been submitted. </li> <li> Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now <code>8 - 4 = 4</code> seconds behind, thus making the scheduling delay 4 seconds long. </li> <li> Total Delay: This is <code>Scheduling Delay + Processing Time</code>. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now <code>8 + 4 = 12</code> seconds long. </li> </ul> A live example from a working Streaming application: <img src="https://i.stack.imgur.com/k7bDV.jpg" alt="Streaming application"> We see that: <ul> <li>The bottom job took 11 seconds to process. So now the next batches scheduling delay is <code>11 - 4 = 7</code> seconds.</li> <li>If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) <code>7 + 1 = 8</code>.</li> </ul>

Spark Streaming Processing Time vs Total Delay vs Processing Delay

Tags:

apache-spark

metrics

streaming

analytics

spark-streaming

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?

I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.

I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.

367

asked Nov 02 '16 16:11

Zak

1 Answers

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:

inputDStream.flatMap(line => line.split(" "))
            .map(word => (word, 1))
            .reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")

Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.

A live example from a working Streaming application:

Streaming application

We see that:

The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.

158

answered Oct 12 '22 15:10

Yuval Itzchakov

Related questions
                            
                                How to apply a function to a column of a Spark DataFrame?
                            
                                How do I convert column of unix epoch to Date in Apache spark DataFrame using Java?
                            
                                Query in Spark SQL inside an array
                            
                                Spark list all cached RDD names and unpersist
                            
                                Request insufficient authentication scopes when running Spark-Job on dataproc
                            
                                Unresolved reference while trying to import col from pyspark.sql.functions in python 3.5
                            
                                IllegalArgumentException thrown when count and collect function in spark
                            
                                could not read data from json using pyspark
                            
                                How to add days (as values of a column) to date?
                            
                                No module named graphframes Jupyter Notebook
                            
                                How to change number of executors in local mode?
                            
                                partitionBy & overwrite strategy in an Azure DataLake using PySpark in Databricks
                            
                                How can I pass a list of columns to select in pyspark dataframe?
                            
                                String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter
                            
                                Apache Spark - Connection refused for worker
                            
                                Spark streaming elasticsearch dependencies
                            
                                How to read csv into sparkR ver 1.4?
                            
                                Outer join Spark dataframe with non-identical join column and then merge join column
                            
                                Window in Spark Streaming?
                            
                                How to know deploy mode of PySpark application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With