Im using spark and Im seeing that when a query have a lot of join operations and group by spark needs to do a lot of shuffle operations. I was looking to find information why this happen but Im not finding nothing in concrete. Can you give a help to understaning this?
Spark shuffles is simply moving around data in the cluster. So ever transformation that require data that is not present locally in the partition would perform a shuffle. Looking at join, each partition needs to go through the entire joined df in order to complete the operation so a shuffle is done which basically moves the joined df to every active partition. Same thing will happen with group by key where all the same keys needs to end up in the same partition so the shuffle moves them there. As you can see this is not very good performance wise so you should try and avoid it if possible.
In simpler words:
The data is spread across.
Sometimes, data has to be moved around.
join
), or when you want to collect all the values for a key together and perform an operation on them (groupByKey
).The amount of data that needs to travel may not always be a lot. For your specific cases:
for joins, if the RDDs are co-partitioned, or if we have made sure that rows with the same keys sit together, there won't be any shuffle during the join!
You can reduce the amount of the data shuffled in a groupByKey
operation by shifting to reduceByKey
. However, this is no silver bullet, there are cases where you may want to stay with groupByKey
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With