What are broadcast variables? What problems do they solve?

Tags:

apache-spark

I am going through Spark Programming guide that says:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Considering the above, what are the use cases of broadcast variables? What problems do broadcast variables solve?

When we create any broadcast variable like below, the variable reference, here it is broadcastVar available in all the nodes in the cluster?

val broadcastVar = sc.broadcast(Array(1, 2, 3))

How long these variables available in the memory of the nodes?

219

asked Nov 12 '14 10:11

1 Answers

If you have huge array that is accessed from Spark Closures, for example some reference data, this array will be shipped to each spark node with closure. For example if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).

If you use broadcast it will be distributed once per node using efficient p2p protocol.

val array: Array[Int] = ??? // some huge array val broadcasted = sc.broadcast(array)

And some RDD

val rdd: RDD[Int] = ???

In this case array will be shipped with closure each time

rdd.map(i => array.contains(i))

and with broadcast you'll get huge performance benefit

rdd.map(i => broadcasted.value.contains(i))

answered Sep 21 '22 19:09

Eugene Zhulenev

Related questions
                            
                                Passing Array to Spark Lit function
                            
                                Triggering spark jobs with REST
                            
                                Why is Apache-Spark - Python so slow locally as compared to pandas?
                            
                                PySpark Drop Rows
                            
                                Retrieve SparkContext from SparkSession
                            
                                java.lang.ClassCastException using lambda expressions in spark job on remote server
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?
                            
                                How do I replace a string value with a NULL in PySpark?
                            
                                SparkSQL - Read parquet file directly
                            
                                How to make shark/spark clear the cache?
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                PySpark Logging?
                            
                                Merge Spark output CSV files with a single header
                            
                                Reading multiple files from S3 in Spark by date period
                            
                                Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?
                            
                                Convert a simple one line string to RDD in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are broadcast variables? What problems do they solve?

Tags:

apache-spark

Ramana

People also ask

1 Answers

Eugene Zhulenev

Recent Activity

Donate For Us