What is the difference between spark's shuffle read and shuffle write?

Tags:

apache-spark-sql

I need to run a spark program which has huge amount of data. I am trying to optimize the spark program and working through spark UI and trying to reduce the Shuffle part.

There are couple of components mentioned, shuffle read and shuffle write. I can understand the difference based their terminology, but I would like to understand the exact meaning of them and which one of spark's shuffle read/write reduces the performance?

I have searched over the internet, but could not find solid in depth details about them, so wanted to see if any one can explain them here.

906

asked Mar 06 '16 01:03

Srini

2 Answers

From the UI tooltip

Shuffle Read

Total shuffle bytes and records read (includes both data read locally and data read from remote executors

Shuffle Write

Bytes and records written to disk in order to be read by a shuffle in a future stage

143

answered Sep 17 '22 20:09

Carlos Bribiescas

I've recently begun working with Spark. I have been looking for answers to the same sort of questions.

When the data from one stage is shuffled to a next stage through the network, the executor(s) that process the next stage pull the data from the first stage's process through TCP. I noticed the shuffle "write" and "read" metrics for each stage are displayed in the Spark UI for a particular job. A stage also potentially had an "input" size (eg. input from HDFS or hive table scan).

I noticed that the shuffle write size from one stage that fed into another stage did not match that stages shuffle read size. If I remember correctly, there are reducer-type operations that can be performed on the shuffle data prior to it being transferred to the next stage/executor as an optimization. Maybe this contributes to the difference in size and therefore the relevance of reporting both values.

answered Sep 20 '22 20:09

Dranyar

Related questions
                            
                                java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
                            
                                Random numbers generation in PySpark
                            
                                Spark Listener EventLoggingListener threw an exception / ConcurrentModificationException
                            
                                spark pivot without aggregation
                            
                                Spark on K8s - getting error: kube mode not support referencing app depenpendcies in local
                            
                                How many RDDs does DStream generate for a batch interval?
                            
                                Running a Job on Spark 0.9.0 throws error
                            
                                Apache Spark Joins example with Java
                            
                                Spark SQL Stackoverflow
                            
                                Using spark-submit, what is the behavior of the --total-executor-cores option?
                            
                                Spark streaming checkpoints for DStreams
                            
                                Spark on Windows - What exactly is winutils and why do we need it?
                            
                                why Livy or spark-jobserver instead of a simple web framework?
                            
                                Failed to load implementation NativeSystemBLAS HiBench
                            
                                Kill a single spark task
                            
                                Apache Spark Python Cosine Similarity over DataFrames
                            
                                Matrix Math With Sparklyr
                            
                                How to write JDBC Sink for Spark Structured Streaming [SparkException: Task not serializable]?
                            
                                Spark Structured Streaming ForeachWriter and database performance
                            
                                Intermittent Timeout Exception using Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With