Broadcast not happening while joining dataframes in Spark 1.6

Tags:

Below is the sample code that I am running. when this spark job runs, Dataframe joins are happening using sortmergejoin instead of broadcastjoin.

def joinedDf (sqlContext: SQLContext,
              txnTable:   DataFrame,
              countriesDfBroadcast: Broadcast[DataFrame]): 
              DataFrame = {
                    txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"),
                    $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")
              }
joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp")

The broadcastjoin is not happening even when I specify a broadcast() hint in the join statement.

The optimizer is hashpartitioning the dataframe and it is causing data skew.

Has anyone seen this behavior?

I am running this on yarn using Spark 1.6 and HiveContext as SQLContext. The spark job runs on 200 executors. and the data size of the txnTable is 240GB and the datasize of countriesDf is 5mb.

342

asked Feb 05 '16 23:02

Prasad R.

1 Answers

Both the way how you broadcast DataFrame and how you access it are incorrect.

Standard broadcasts cannot be used to handle distributed data structures. If you want to perform broadcast join on a DataFrame you should use broadcast functions which marks given DataFrame for broadcasting:

import org.apache.spark.sql.functions.broadcast

val countriesDf: DataFrame = ???
val tmp: DataFrame = broadcast(
  countriesDf.withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries")
) 

txnTable.as("df1").join(
  broadcast(tmp), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")

Internally it will collect tmp without converting from internal and broadcast afterwards.

join arguments are eagerly evaluated. Even it was possible to use SparkContext.broadcast with distributed data structure broadcast value is evaluated locally before join is called. Thats' why your function work at all but doesn't perform broadcast join.

177

answered Sep 28 '22 10:09

zero323

Related questions
                            
                                Parboiled2 causes "missing or invalid dependency detected while loading class file 'Prepender.class'"
                            
                                Using `title` with ScalaTags
                            
                                json4s jackson - How to ignore field using annotations
                            
                                AWS S3: Uploading large file fails with ResetException: Failed to reset the request input stream
                            
                                Convert Matrix to RowMatrix in Apache Spark using Scala
                            
                                How to find the name of the enclosing source file in Scala 2.11
                            
                                Difference between Scala REPL and Clojure REPL - compile speed
                            
                                Akka scheduling patterns
                            
                                How do I compose a list of `Futures`?
                            
                                Caused by: java.sql.SQLException: JDBC4 Connection.isValid() method not supported
                            
                                Using v. Not Using the `self` Type
                            
                                Parsing a Json String in Scala using Play framework
                            
                                Akka cluster detecting Quarantined state
                            
                                Understanding closures and parallelism in Spark
                            
                                Idiomatic way to use Spark DStream as Source for an Akka stream
                            
                                Scala compiler optimization for immutability
                            
                                accumulator of Spark is confusing me.
                            
                                Spark: group concat equivalent in scala rdd
                            
                                In Apache spark, what is the difference between using mapPartitions and combine use of broadcast variable and map
                            
                                How to prevent Scala Futures from creating a memory leak

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Broadcast not happening while joining dataframes in Spark 1.6

Tags:

join

scala

apache-spark

apache-spark-sql

query-optimization

Prasad R.

People also ask

1 Answers

zero323

Recent Activity

Donate For Us