Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the maximum size for a broadcast object in Spark?

When using Dataframe broadcast function or the SparkContext broadcast functions, what is the maximum object size that can be dispatched to all executors?

like image 946
Kirk Broadhurst Avatar asked Dec 08 '16 18:12

Kirk Broadhurst


People also ask

What is a broadcast variable in Spark?

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.

What is benefit of performing broadcasting in Spark?

Without having to waste a lot of time and transfer of network input and output, they can be used in giving a node a large copy of the input dataset. Broadcast variables can be distributed by Spark using a variety of broadcast algorithms which might turn largely and the cost of communication is reduced.

Can we broadcast RDD in Spark?

In the PySpark Resilient Distributed Datasets(RDD) and DataFrame, the Broadcast variables are the read-only shared variables that are cached and are available on all nodes in the cluster in-order to access or use by the tasks.


2 Answers

broadcast function :

Default is 10mb but we have used till 300 mb which is controlled by spark.sql.autoBroadcastJoinThreshold.

AFAIK, It all depends on memory available. so there is no definite answer for this. what I would say is, it should be less than large dataframe and you can estimate large or small dataframe size like below...

import org.apache.spark.util.SizeEstimator

logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))

based on this you can pass broadcast hint to framework.

Also have a look at scala doc from sql/execution/SparkStrategies.scala

which says....

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that side has an explicit broadcast hint (e.g. the user applied the
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side of the join will be broadcasted and the other side will be streamed, with no shuffling
    performed. If both sides are below the threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin

Also have a look at other-configuration-options

SparkContext.broadcast (TorrentBroadcast ) :

broadcast shared variable also has a property spark.broadcast.blockSize=4M AFAIK there is no hard core limitation I have seen for this as well...

for Further information pls. see TorrentBroadcast.scala


EDIT :

However you can have look at 2GB issue Even though that was officially not declared in docs (I was not able to see anything of this kind in docs). pls look at SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .

like image 85
Ram Ghadiyaram Avatar answered Sep 19 '22 04:09

Ram Ghadiyaram


As of Spark 2.4, there's an upper limit of 8 GB. Source Code

like image 23
Vijayant Avatar answered Sep 18 '22 04:09

Vijayant