I have a node that has 24 cores and 124Gb ram in my spark cluster. When I set the spark.executor.memory field to 4g, and then broadcast a variable that takes 3.5gb to store in ram, will the cores collectively hold 24 copies of that variable? Or one copy?
I am using pyspark - v1.6.2
Broadcast variables are used to save the copy of data across all nodes. This variable is cached on all the machines and not sent on machines with tasks.
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums. This guide shows each of these features in each of Spark's supported languages.
A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
I believe that PySpark doesn't use any form of shared memory to share broadcast variables between the workers.
On Unix-like systems broadcast variables are loaded in the main function of the worker which is called only after forking from the daemon so there are not accessible from the parent process space.
If you want to reduce footprint of the large variables without using external service I would recommend using file backed objects with memory-map. This way you can efficiently use for example NumPy arrays.
In contrast native (JVM) Spark applications indeed share broadcast variables between multiple executor threads on a single executor JVM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With