I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller. Is there a way to avoid all this shuffling? I cannot set <code>autoBroadCastJoinThreshold</code>, because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes. Is there a way to force broadcast ignoring this variable?

<h3>Broadcast Hash Joins (similar to map side join or map-side combine in Mapreduce) :</h3> In SparkSQL you can see the type of join being performed by calling <code>queryExecution.executedPlan</code>. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method <code>broadcast</code> on the <code>DataFrame</code> before joining it Example: <code>largedataframe.join(broadcast(smalldataframe), "key")</code> <blockquote> in DWH terms, where largedataframe may be like fact smalldataframe may be like dimension </blockquote> As described by my fav book (HPS) pls. see below to have better understanding.. <img src="https://i.stack.imgur.com/S4c1x.png" alt="enter image description here"> Note : Above <code>broadcast</code> is from <code>import org.apache.spark.sql.functions.broadcast</code> not from <code>SparkContext</code> Spark also, automatically uses the <code>spark.sql.conf.autoBroadcastJoinThreshold</code> to determine if a table should be broadcast. <h3>Tip : see DataFrame.explain() method</h3> <pre class="prettyprint"><code>def explain(): Unit Prints the physical plan to the console for debugging purposes. </code></pre> <hr> <blockquote> <h3>Is there a way to force broadcast ignoring this variable?</h3> </blockquote> <h3><code>sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")</code></h3> <hr> NOTE : <blockquote> Another similar out of box note w.r.t. Hive (not spark) : Similar thing can be achieved using hive hint <code>MAPJOIN</code> like below... </blockquote> <pre class="prettyprint"><code>Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key hive> set hive.auto.convert.join=true; hive> set hive.auto.convert.join.noconditionaltask.size=20971520 hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.use.nonstaged=true; hive> set hive.mapjoin.smalltable.filesize = 30000000; // default 25 mb made it as 30mb </code></pre> Further Reading : Please refer my article on BHJ, SHJ, SMJ

You can hint for a dataframe to be broadcasted by using <code>left.join(broadcast(right), ...)</code>

DataFrame join optimization - Broadcast Hash Join

Tags:

dataframe

apache-spark

apache-spark-sql

apache-spark-1.4

I am trying to effectively join two DataFrames, one of which is large and the second is a bit smaller.

Is there a way to avoid all this shuffling? I cannot set autoBroadCastJoinThreshold, because it supports only Integers - and the table I am trying to broadcast is slightly bigger than integer number of bytes.

Is there a way to force broadcast ignoring this variable?

865

asked Sep 07 '15 09:09

NNamed

2 Answers

Broadcast Hash Joins (similar to map side join or map-side combine in Mapreduce) :

In SparkSQL you can see the type of join being performed by calling queryExecution.executedPlan. As with core Spark, if one of the tables is much smaller than the other you may want a broadcast hash join. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it

Example: largedataframe.join(broadcast(smalldataframe), "key")

in DWH terms, where largedataframe may be like fact
smalldataframe may be like dimension

As described by my fav book (HPS) pls. see below to have better understanding.. enter image description here

Note : Above broadcast is from import org.apache.spark.sql.functions.broadcast not from SparkContext

Spark also, automatically uses the spark.sql.conf.autoBroadcastJoinThreshold to determine if a table should be broadcast.

Tip : see DataFrame.explain() method

def explain(): Unit Prints the physical plan to the console for debugging purposes.

Is there a way to force broadcast ignoring this variable?

`sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold = -1")`

NOTE :

Another similar out of box note w.r.t. Hive (not spark) : Similar thing can be achieved using hive hint MAPJOIN like below...

Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key  hive> set hive.auto.convert.join=true; hive> set hive.auto.convert.join.noconditionaltask.size=20971520 hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.use.nonstaged=true; hive> set hive.mapjoin.smalltable.filesize = 30000000; // default 25 mb made it as 30mb

Further Reading : Please refer my article on BHJ, SHJ, SMJ

154

answered Sep 25 '22 23:09

Ram Ghadiyaram

You can hint for a dataframe to be broadcasted by using left.join(broadcast(right), ...)

answered Sep 24 '22 23:09

Sebastian Piu

Related questions
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                How to set hadoop configuration values from pyspark
                            
                                Add column sum as new column in PySpark dataframe
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Spark union of multiple RDDs
                            
                                How to set amount of Spark executors?
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                Aggregating multiple columns with custom function in Spark
                            
                                Specifying the filename when saving a DataFrame as a CSV [duplicate]
                            
                                Calling Java/Scala function from a task
                            
                                Getting the count of records in a data frame quickly
                            
                                pyspark: rolling average using timeseries data
                            
                                Where do you need to use lit() in Pyspark SQL?
                            
                                Spark on yarn concept understanding
                            
                                Is there better way to display entire Spark SQL DataFrame?
                            
                                PySpark row-wise function composition
                            
                                SPARK SQL - case when then
                            
                                How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?
                            
                                Can I add arguments to python code when I submit spark job?
                            
                                PySpark create new column with mapping from a dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With