Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark Do python processes on an executor node share broadcast variables in ram?

I have a node that has 24 cores and 124Gb ram in my spark cluster. When I set the spark.executor.memory field to 4g, and then broadcast a variable that takes 3.5gb to store in ram, will the cores collectively hold 24 copies of that variable? Or one copy?

I am using pyspark - v1.6.2

like image 242
ThatDataGuy Avatar asked Oct 17 '16 09:10

ThatDataGuy


People also ask

What does broadcast method do in PySpark?

Broadcast variables are used to save the copy of data across all nodes. This variable is cached on all the machines and not sent on machines with tasks.

What is shared variables in spark explain accumulator and broadcast variables in spark?

Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums. This guide shows each of these features in each of Spark's supported languages.

Which is true of a broadcast variable in spark?

A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.


1 Answers

I believe that PySpark doesn't use any form of shared memory to share broadcast variables between the workers.

On Unix-like systems broadcast variables are loaded in the main function of the worker which is called only after forking from the daemon so there are not accessible from the parent process space.

If you want to reduce footprint of the large variables without using external service I would recommend using file backed objects with memory-map. This way you can efficiently use for example NumPy arrays.

In contrast native (JVM) Spark applications indeed share broadcast variables between multiple executor threads on a single executor JVM.

like image 126
zero323 Avatar answered Sep 20 '22 14:09

zero323