Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why join and group by affects the amount of data shuffle in spark

Im using spark and Im seeing that when a query have a lot of join operations and group by spark needs to do a lot of shuffle operations. I was looking to find information why this happen but Im not finding nothing in concrete. Can you give a help to understaning this?

like image 227
jUsr Avatar asked Feb 06 '23 21:02

jUsr


2 Answers

Spark shuffles is simply moving around data in the cluster. So ever transformation that require data that is not present locally in the partition would perform a shuffle. Looking at join, each partition needs to go through the entire joined df in order to complete the operation so a shuffle is done which basically moves the joined df to every active partition. Same thing will happen with group by key where all the same keys needs to end up in the same partition so the shuffle moves them there. As you can see this is not very good performance wise so you should try and avoid it if possible.

like image 154
z-star Avatar answered May 24 '23 13:05

z-star


In simpler words:

  1. The data is spread across.

    • Spark runs on top of a distributed file system, like HDFS. Since it's a distributed file system, the data is spread across the cluster.
    • An RDD is an abstraction for a distributed dataset, thus data that constitutes an RDD is spread across the cluster.
  2. Sometimes, data has to be moved around.

    • Whenever you come across an operation for which the rows that have the same key will need to be together, be worried.
    • Since it is a distributed file system, rows with the same key will need to travel across the cluster (shuffled) to be together. This is the case, for example, when you want to combine two RDDs by the key (join), or when you want to collect all the values for a key together and perform an operation on them (groupByKey).
  3. The amount of data that needs to travel may not always be a lot. For your specific cases:

    • for joins, if the RDDs are co-partitioned, or if we have made sure that rows with the same keys sit together, there won't be any shuffle during the join!

    • You can reduce the amount of the data shuffled in a groupByKey operation by shifting to reduceByKey. However, this is no silver bullet, there are cases where you may want to stay with groupByKey.

like image 22
axiom Avatar answered May 24 '23 13:05

axiom