Broadcast hash join - Iterative

Tags:

We use broadcast hash join in Spark when we have one dataframe small enough to get fit into memory. When the size of small dataframe is below spark.sql.autoBroadcastJoinThreshold I have few questions around this.

What is the life cycle of the small dataframe which we hint as broadcast? For how long it will remain in memory? How can we control it?

For example if I have joined a big dataframe with small dataframe two times using broadcast hash join. when first join performs it will broadcast the small dataframe to worker nodes and perform the join while avoiding shuffling of big dataframe data.

My question is that for how long will executor keep a copy of broadcast dataframe? Will it remain in memory till session ends? Or it will get cleared once we have taken any action. can we control or clear it? Or I am just thinking in wrong direction...

862

asked Dec 14 '18 17:12

vikrant rana

2 Answers

The answer to your question, at least in Spark 2.4.0, is that the dataframe will remain in memory on the driver process until the SparkContext is completed, that is, until your application ends.

Broadcast joins are in fact implemented using broadcast variables, but when using the DataFrame API you do not get access to the underling broadcast variable. Spark itself does not destroy this variable after it uses it internally, so it just stays around.

Specifically, if you look at the code of BroadcastExchangeExec (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala), you can see that it creates a private variable relationFuture which holds the Broadcast variable. This private variable is only used in this class. There is no way for you as a user to get access to it to call destroy on it, and nowhere in the curretn implementation does Spark call it for you.

answered Sep 28 '22 05:09

Dave DeCaprio

The idea here is to create broadcast variable before join to easily control it. Without it you can't control these variables - spark do it for you.

Example:

from pyspark.sql.functions import broadcast

sdf2_bd = broadcast(sdf2)
sdf1.join(sdf2_bd, sdf1.id == sdf2_bd.id)

To all broadcast variables(automatically created in joins or created by hands) this rules are applied:

The broadcast data is sent only to the nodes that contain an executor that needs it.
The broadcast data is stored in memory. If not enough memory is available, the disk is used.
When you are done with a broadcast variable, you should destroy it to release memory.

answered Sep 28 '22 06:09

luminousmen

Related questions
                            
                                Why does sortBy transformation trigger a Spark job?
                            
                                Error initializing SparkContext: A master URL must be set in your configuration
                            
                                Does Spark preserve record order when reading in ordered files?
                            
                                Convert spark dataframe to Array[String]
                            
                                Reading data from Azure Blob with Spark
                            
                                Understanding Spark RandomForest featureImportances results
                            
                                collect() or toPandas() on a large DataFrame in pyspark/EMR
                            
                                Spark: JavaRDD<Tuple2> to JavaPairRDD<>
                            
                                How to create a Row from a List or Array in Spark using Scala
                            
                                How to find out the amount of memory pyspark has from iPython interface?
                            
                                Spark Submit fails with java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
                            
                                How to name file when saveAsTextFile in spark?
                            
                                How to access broadcasted DataFrame in Spark
                            
                                Spark Streaming from Kafka has error numRecords must not be negative
                            
                                Get the max value for each key in a Spark RDD
                            
                                Scala and Spark UDF function
                            
                                Structured Streaming exception when using append output mode with watermark
                            
                                How to know the number of Spark jobs and stages in (broadcast) join query?
                            
                                What is the =!= operator in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Broadcast hash join - Iterative

Tags:

apache-spark

apache-spark-sql

pyspark

vikrant rana

People also ask

2 Answers

Dave DeCaprio

luminousmen

Recent Activity

Donate For Us