pyspark Do python processes on an executor node share broadcast variables in ram?

Tags:

I have a node that has 24 cores and 124Gb ram in my spark cluster. When I set the spark.executor.memory field to 4g, and then broadcast a variable that takes 3.5gb to store in ram, will the cores collectively hold 24 copies of that variable? Or one copy?

I am using pyspark - v1.6.2

242

asked Oct 17 '16 09:10

ThatDataGuy

1 Answers

I believe that PySpark doesn't use any form of shared memory to share broadcast variables between the workers.

On Unix-like systems broadcast variables are loaded in the main function of the worker which is called only after forking from the daemon so there are not accessible from the parent process space.

If you want to reduce footprint of the large variables without using external service I would recommend using file backed objects with memory-map. This way you can efficiently use for example NumPy arrays.

In contrast native (JVM) Spark applications indeed share broadcast variables between multiple executor threads on a single executor JVM.

126

answered Sep 20 '22 14:09

zero323

Related questions
                            
                                How do I extend Django admin's DateFieldListFilter class ?
                            
                                element wise test of numpy array is numeric
                            
                                Line Profiling inner function with Cython
                            
                                append page to existing pdf file using python (and matplotlib?)
                            
                                How to put the title at the bottom of a figure in matplotlib?
                            
                                Class with changing __hash__ still works with dictionary access
                            
                                botocore.exceptions.ProfileNotFound when code run on AWS elastic beanstalk, but locally it's OK
                            
                                How to use SQLAlchemy with class attributes (and properties)?
                            
                                3D Geometry Package for Python [closed]
                            
                                Pandas read_sql query with multiple selects
                            
                                Jupyter: Replot in different cell
                            
                                extract hour from timestamp with python
                            
                                Incomplete coordinate values for Google Vision OCR
                            
                                Python : error with importing md5
                            
                                Python datetime difference between .localize and tzinfo
                            
                                Use multi-processing/threading to break numpy array operation into chunks
                            
                                How to uninstall python jupyter correctly?
                            
                                Blaze with Scikit Learn K-Means
                            
                                how to effeciently convert ROS PointCloud2 to pcl point cloud and visualize it in python
                            
                                How can I write a C function that takes either an int or a float?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark Do python processes on an executor node share broadcast variables in ram?

Tags:

python

apache-spark

pyspark

shared-memory

ThatDataGuy

People also ask

1 Answers

zero323

Recent Activity

Donate For Us