Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Spark Planner prefer sort merge join over shuffled hash join?

Why does Spark Planner in Spark 2.3 prefer a sort merge join over a shuffled hash join? In other words, why is spark.sql.join.preferSortMergeJoin configuration property internal and turned on by default? What's wrong with a shuffled hash join? Is this specific to Spark that it does computations in distributed fashion or something else more inherent in the join algorithm?

You can find the property used in the JoinSelection execution planning strategy here and here that looks like:

case ... if !conf.preferSortMergeJoin && ... =>
  Seq(joins.ShuffledHashJoinExec(...))
like image 865
Jacek Laskowski Avatar asked Apr 25 '18 10:04

Jacek Laskowski


Video Answer


1 Answers

In order to answer your question

What's wrong with a shuffled hash join?

I'll provide some context first.


According to SPARK-11675 Shuffled Hash Join was removed in Spark 1.6 and the reason was

... I think we should just standardize on sort merge join for large joins for now, and create better implementations of hash joins if needed in the future

and reintroduced in Spark 2.0 according to SPARK-13977 because

ShuffledHashJoin is still useful when:

1) any partition of the build side could fit in memory

2) the build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.

It's worth mentioning the PR for SPARK-13977 which points that Shuffled Hash Join was removed in favor of Sort Merge Join which is faster and more robust.


I'm not sure how much faster is Sort Merge Join compared to Shuffled Hash Join but it's definetelly more robust as Shuffled Hash Join requires the hashed table to fit in memory, counter to Sort Merge Join which can spill to disk.

like image 69
Panos Avatar answered Oct 20 '22 10:10

Panos