Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

like image 301
Raphael Roth Avatar asked Jan 08 '18 06:01

Raphael Roth


2 Answers

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

like image 80
Jacek Laskowski Avatar answered Sep 29 '22 13:09

Jacek Laskowski


Along with setting spark.sql.autoBroadcastJoinThreshold to 0 or to a negative value as per Jacek's response, check the state of 'spark.sql.join.preferSortMergeJoin'

Hint for Sort Merge join : Set the above conf to true

Hint for Shuffled Hash join: Set the above conf to false.

like image 27
V Jaiswal Avatar answered Sep 29 '22 12:09

V Jaiswal