I have an issue with a join
in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join
although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.
So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast
), but is there also a way to force another join algorithm?
I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0
, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.
If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).
With that said, don't force a broadcast hash join (using broadcast
standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold
to be 0
or negative.
Along with setting spark.sql.autoBroadcastJoinThreshold to 0 or to a negative value as per Jacek's response, check the state of 'spark.sql.join.preferSortMergeJoin'
Hint for Sort Merge join : Set the above conf to true
Hint for Shuffled Hash join: Set the above conf to false.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With