Why does Spark Planner in Spark 2.3 prefer a sort merge join over a shuffled hash join? In other words, why is spark.sql.join.preferSortMergeJoin
configuration property internal and turned on by default? What's wrong with a shuffled hash join? Is this specific to Spark that it does computations in distributed fashion or something else more inherent in the join algorithm?
You can find the property used in the JoinSelection
execution planning strategy here and here that looks like:
case ... if !conf.preferSortMergeJoin && ... =>
Seq(joins.ShuffledHashJoinExec(...))
In order to answer your question
What's wrong with a shuffled hash join?
I'll provide some context first.
According to SPARK-11675 Shuffled Hash Join was removed in Spark 1.6 and the reason was
... I think we should just standardize on sort merge join for large joins for now, and create better implementations of hash joins if needed in the future
and reintroduced in Spark 2.0 according to SPARK-13977 because
ShuffledHashJoin is still useful when:
1) any partition of the build side could fit in memory
2) the build side is much smaller than stream side, the building hash table on smaller side should be faster than sorting the bigger side.
It's worth mentioning the PR for SPARK-13977 which points that Shuffled Hash Join was removed in favor of Sort Merge Join which is faster and more robust.
I'm not sure how much faster is Sort Merge Join compared to Shuffled Hash Join but it's definetelly more robust as Shuffled Hash Join requires the hashed table to fit in memory, counter to Sort Merge Join which can spill to disk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With