This is a very simple question: in spark, <code>broadcast</code> can be used to send variables to executors efficiently. How does this work ? More precisely: <ul> <li>when are values sent : as soon as I call <code>broadcast</code>, or when the values are used ?</li> <li>Where exactly is the data sent : to all executors, or only to the ones that will need it ?</li> <li>where is the data stored ? In memory, or on disk ?</li> <li>Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call the <code>.value</code> method ?</li> </ul>

<h3>Short answer</h3> <ul> <li>Values are sent the first time they are needed in an executor. Nothing is sent when <code>sc.broadcast(variable)</code> is called.</li> <li>The data is sent only to the nodes that contain an executor that needs it.</li> <li>The data is stored in memory. If not enough memory is available, the disk is used.</li> <li>Yes, there is a big difference between accessing a local variable and a broadcast variable. Broadcast variables have to be downloaded the first time they are accessed.</li> </ul> <h3>Long answer</h3> The answer is in Spark's source, in <code>TorrentBroadcast.scala</code>. <ol> <li> When <code>sc.broadcast</code> is called, a new <code>TorrentBroadcast</code> object is instantiated from <code>BroadcastFactory.scala</code>. The following happens in <code>writeBlocks()</code>, which is called when the TorrentBroadcast object is initialized: <ol> <li>The object is cached unserialized locally using the <code>MEMORY_AND_DISK</code> policy.</li> <li>It is serialized.</li> <li>The serialized version is split into 4Mb blocks, that are compressed[0], and saved locally[1].</li> </ol> </li> <li>When new executors are created, they only have the lightweight <code>TorrentBroadcast</code> object, that only contains the broadcast object's identifier, and its number of blocks.</li> <li> The <code>TorrentBroadcast</code> object has a lazy[2] property that contains its value. When the <code>value</code> method is called, this lazy property is returned. So the first time this value function is called on a task, the following happens: <ol> <li>In a random order, blocks are fetched from the local block manager and uncompressed.</li> <li>If they are not present in the local block manager, <code>getRemoteBytes</code> is called on the block manager to fetch them. Network traffic happens only at that time.</li> <li>If the block wasn't present locally, it is cached using <code>MEMORY_AND_DISK_SER</code>.</li> </ol> </li> </ol> <hr> [0] Compressed with lz4 by default. This can be tuned. [1] The blocks are stored in the local block manager, using <code>MEMORY_AND_DISK_SER</code>, which means that it spills partitions that don't fit in memory to disk. Each block has an unique identifier, computed from the identifier of the broadcast variable, and its offset. The size of blocks can be configured; it is 4Mb by default. [2] A lazy val in scala is a variable whose value is evaluated the first time it is accessed, and then cached. See the documentation.

<ul> <li>as soon as it is broadcasted</li> <li>it is send to all executors using torrent protocol but loaded only when needed</li> <li>once loaded variables are stored deserialized in memory </li> <li> it: <ul> <li>validates that broadcast hasn't been destroyed</li> <li>lazily loads variable from blockManager</li> </ul> </li> </ul>

In spark, how does broadcast work?

2 Answers

Short answer

Values are sent the first time they are needed in an executor. Nothing is sent when sc.broadcast(variable) is called.
The data is sent only to the nodes that contain an executor that needs it.
The data is stored in memory. If not enough memory is available, the disk is used.
Yes, there is a big difference between accessing a local variable and a broadcast variable. Broadcast variables have to be downloaded the first time they are accessed.

Long answer

The answer is in Spark's source, in TorrentBroadcast.scala.

When sc.broadcast is called, a new TorrentBroadcast object is instantiated from BroadcastFactory.scala. The following happens in writeBlocks(), which is called when the TorrentBroadcast object is initialized:
1. The object is cached unserialized locally using the MEMORY_AND_DISK policy.
2. It is serialized.
3. The serialized version is split into 4Mb blocks, that are compressed^[0], and saved locally^[1].
When new executors are created, they only have the lightweight TorrentBroadcast object, that only contains the broadcast object's identifier, and its number of blocks.
The TorrentBroadcast object has a lazy^[2] property that contains its value. When the value method is called, this lazy property is returned. So the first time this value function is called on a task, the following happens:
1. In a random order, blocks are fetched from the local block manager and uncompressed.
2. If they are not present in the local block manager, getRemoteBytes is called on the block manager to fetch them. Network traffic happens only at that time.
3. If the block wasn't present locally, it is cached using MEMORY_AND_DISK_SER.

^[0] Compressed with lz4 by default. This can be tuned.

^[1] The blocks are stored in the local block manager, using MEMORY_AND_DISK_SER, which means that it spills partitions that don't fit in memory to disk. Each block has an unique identifier, computed from the identifier of the broadcast variable, and its offset. The size of blocks can be configured; it is 4Mb by default.

^[2] A lazy val in scala is a variable whose value is evaluated the first time it is accessed, and then cached. See the documentation.

182

answered Nov 17 '22 00:11

lovasoa

as soon as it is broadcasted
it is send to all executors using torrent protocol but loaded only when needed
once loaded variables are stored deserialized in memory
it:
- validates that broadcast hasn't been destroyed
- lazily loads variable from blockManager

answered Nov 17 '22 00:11

fc787bc0

Related questions
                            
                                What is a glom?. How it is different from mapPartitions?
                            
                                Pyspark : forward fill with last observation for a DataFrame
                            
                                Read from a hive table and write back to it using spark sql
                            
                                pyspark parse fixed width text file
                            
                                Error while exploding a struct column in Spark
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                Spark DataFrame and renaming multiple columns (Java)
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How to read streaming dataset once and output to multiple sinks?
                            
                                Difference between sc.textFile and spark.read.text in Spark
                            
                                Spark: Repartition strategy after reading text file
                            
                                How does Spark interoperate with CPython
                            
                                Scale(Normalise) a column in SPARK Dataframe - Pyspark
                            
                                Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
                            
                                Addition of two RDD[mllib.linalg.Vector]'s
                            
                                How to deal with tasks running too long (comparing to others in job) in yarn-client?
                            
                                Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"
                            
                                Should we parallelize a DataFrame like we parallelize a Seq before training
                            
                                Package-private scope in Scala visible from Java
                            
                                SparkContext.addFile vs spark-submit --files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In spark, how does broadcast work?

Tags:

apache-spark

hadoop2

bigdata

lovasoa

People also ask

2 Answers

Short answer

Long answer

lovasoa

fc787bc0

Recent Activity

Donate For Us