Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section. This "Shuffle Read Blocked Time" seems to account for upwards of 50% of the task duration for a large swath of tasks. While I can intuit some possibilities for what this means, I can't find any documentation that explains what it actually represents. Needless to say, I also haven't been able to find any resources on mitigation strategies. Can anyone provide some insight into how I might reduce Shuffle Read Blocked Time?

"Shuffle Read Blocked Time" is the time that tasks spent blocked waiting for shuffle data to be read from remote machines. The exact metric it feeds from is shuffleReadMetrics.fetchWaitTime. Hard to give input into a strategy to mitigate it without actually knowing what data you're trying to read or what sort of remote machines you're reading from. However, consider the following: <ol> <li>Check your connection to the remote machines from which you're reading data. </li> <li>Check your code/jobs to ensure that you're only reading data that you absolutely need to read to finish your job.</li> <li>In some cases, you could consider splitting your job into multiple jobs that run in parallel, so long as they are independent of each other.</li> <li>Perhaps you could upgrade your cluster to have more nodes so you can split the workload to be more granular and thus have an overall smaller wait time.</li> </ol> As to the metrics, this documentation should shed some light on them: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-webui-StagePage.html Lastly, i did also find it hard to find information on Shuffle Read Blocked Time, but if you put in quotes like: "Shuffle Read Blocked Time" in a google search, you'll find some decent results.

Spark - Shuffle Read Blocked Time

Tags:

apache-spark

apache-spark-sql

pyspark

Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section.

This "Shuffle Read Blocked Time" seems to account for upwards of 50% of the task duration for a large swath of tasks.

While I can intuit some possibilities for what this means, I can't find any documentation that explains what it actually represents. Needless to say, I also haven't been able to find any resources on mitigation strategies.

Can anyone provide some insight into how I might reduce Shuffle Read Blocked Time?

484

asked May 26 '16 18:05

dayman

1 Answers

"Shuffle Read Blocked Time" is the time that tasks spent blocked waiting for shuffle data to be read from remote machines. The exact metric it feeds from is shuffleReadMetrics.fetchWaitTime.

Hard to give input into a strategy to mitigate it without actually knowing what data you're trying to read or what sort of remote machines you're reading from. However, consider the following:

Check your connection to the remote machines from which you're reading data.
Check your code/jobs to ensure that you're only reading data that you absolutely need to read to finish your job.
In some cases, you could consider splitting your job into multiple jobs that run in parallel, so long as they are independent of each other.
Perhaps you could upgrade your cluster to have more nodes so you can split the workload to be more granular and thus have an overall smaller wait time.

As to the metrics, this documentation should shed some light on them: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-webui-StagePage.html

Lastly, i did also find it hard to find information on Shuffle Read Blocked Time, but if you put in quotes like: "Shuffle Read Blocked Time" in a google search, you'll find some decent results.

answered Sep 17 '22 12:09

user3124181

Related questions
                            
                                Why is collect in SparkR so slow?
                            
                                Is it possible to configure Apache Livy to run with Spark Standalone?
                            
                                Spark DStream periodically call saveAsObjectFile using transform does not work as expected
                            
                                Apply sklearn trained model on a dataframe with PySpark
                            
                                Spark: Exception in thread "main" org.apache.spark.sql.catalyst.errors.package
                            
                                Reading csv files with missing columns and random column order
                            
                                Best approach to check if Spark streaming jobs are hanging
                            
                                Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest"
                            
                                Why Parquet over some RDBMS like Postgres
                            
                                How to run inference of a pytorch model on pyspark dataframe (create new column with prediction) using pandas_udf?
                            
                                Hadoop + Spark: There are 1 datanode(s) running and 1 node(s) are excluded in this operation
                            
                                how to use sparks implicit conversion (e.g. $) in IntelliJ debugger evaluate expression
                            
                                Connection Refused When Running SparkPi Locally
                            
                                Spark: PageRank example when iteration too large throws stackoverflowError
                            
                                Saving a >>25T SchemaRDD in Parquet format on S3
                            
                                How to use the RangePartitioner in Spark
                            
                                Spark and HBase Snapshots
                            
                                spark 1.4.0 java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J
                            
                                Pyspark: shuffle RDD
                            
                                VectorAssembler output only to DenseVector?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With