Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the Difference between Broadcast hash join and Broadcast Nested loop join in Spark?

Tags:

apache-spark

What is the Difference between Broadcast hash join and Broadcast Nested loop join in Spark? in Which scenario spark will pick which and which one is faster?

like image 516
rupesh kumar Avatar asked Mar 03 '23 23:03

rupesh kumar


1 Answers

You can get some informations from the source-code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L111

Broadcast hash join (BHJ): Only supported for equi-joins, while the join keys do not need to be sortable. Supported for all join types except full outer joins. BHJ usually performs faster than the other join algorithms when the broadcast side is small. However, broadcasting tables is a network-intensive operation and it could cause OOM or perform badly in some cases, especially when the build/broadcast side is big.

Broadcast nested loop join (BNLJ): Supports both equi-joins and non-equi-joins. Supports all the join types, but the implementation is optimized for: 1) broadcasting the left side in a right outer join; 2) broadcasting the right side in a left outer, left semi, left anti or existence join; 3) broadcasting either side in an inner-like join. For other cases, we need to scan the data multiple times, which can be rather slow.

like image 171
Raphael Roth Avatar answered Mar 06 '23 12:03

Raphael Roth