How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

Question

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

Jacek Laskowski · Accepted Answer

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

V Jaiswal · Answer

Along with setting spark.sql.autoBroadcastJoinThreshold to 0 or to a negative value as per Jacek's response, check the state of 'spark.sql.join.preferSortMergeJoin'

Hint for Sort Merge join : Set the above conf to true

Hint for Shuffled Hash join: Set the above conf to false.

How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

Tags:

scala

apache-spark

apache-spark-sql

Raphael Roth

2 Answers

Jacek Laskowski

V Jaiswal

Recent Activity

Donate For Us

How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

Tags:

scala

apache-spark

apache-spark-sql

Raphael Roth

2 Answers

Jacek Laskowski

V Jaiswal

Related questions

Recent Activity

Donate For Us