I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers. Let's say I have this setup: <pre class="prettyprint"><code>def main(): sc = SparkContext() SomeMethod(sc) def SomeMethod(sc): someValue = rand() V = sc.broadcast(someValue) A = sc.parallelize().map(worker) def worker(element): element *= V.value ### NameError: global name 'V' is not defined ### </code></pre> However, if I instead eliminate the <code>SomeMethod()</code> middleman, it works fine. <pre class="prettyprint"><code>def main(): sc = SparkContext() someValue = rand() V = sc.broadcast(someValue) A = sc.parallelize().map(worker) def worker(element): element *= V.value # works just fine </code></pre> I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers? Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?

I am not sure I completely understood the question but, if you need the <code>V</code> object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained: <pre class="prettyprint"><code>def worker(V, element): element *= V.value </code></pre> Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function: <pre class="prettyprint"><code>from functools import partial def SomeMethod(sc): someValue = rand() V = sc.broadcast(someValue) A = sc.parallelize().map(partial(worker, V=V)) </code></pre>

PySpark broadcast variables from local functions

Tags:

python

apache-spark

pyspark

I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers.

Let's say I have this setup:

def main():
    sc = SparkContext()
    SomeMethod(sc)

def SomeMethod(sc):
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(worker)

def worker(element):
    element *= V.value  ### NameError: global name 'V' is not defined ###

However, if I instead eliminate the SomeMethod() middleman, it works fine.

def main():
    sc = SparkContext()
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(worker)

def worker(element):
    element *= V.value   # works just fine

I'd rather not have to put all my Spark logic in the main method, if I can. Is there any way to broadcast variables from within local functions and have them be globally visible to the Spark workers?

Alternatively, what would be a good design pattern for this kind of situation--e.g., I want to write a method specifically for Spark which is self-contained and performs a specific function I'd like to re-use?

846

asked Nov 16 '14 16:11

Magsol

1 Answers

I am not sure I completely understood the question but, if you need the V object inside the worker function you then you definitely should pass it as a parameter, otherwise the method is not really self-contained:

def worker(V, element):
    element *= V.value

Now in order to use it in map functions you need to use a partial, so that map only sees a 1 parameter function:

from functools import partial

def SomeMethod(sc):
    someValue = rand()
    V = sc.broadcast(someValue)
    A = sc.parallelize().map(partial(worker, V=V))

answered Oct 26 '22 04:10

elyase

Related questions
                            
                                How can i convert images from scikit-image to opencv2 and other libraries?
                            
                                Is it possible to include command line options in the python shebang?
                            
                                How do you click on an element which is hidden using Selenium WebDriver?
                            
                                Unable to save matplotlib.figure Figure, canvas is None
                            
                                Avoid using global without confusing new programming students in Python?
                            
                                How can I split a string into tokens?
                            
                                re-raising an exception in a context handler
                            
                                Python--Finding Parent Keys for a specific value in a nested dictionary
                            
                                boto dynamodb2: Can I query a table using range key only?
                            
                                The right way to round pandas.DataFrame?
                            
                                Keyboard Interrupt with python's multiprocessing
                            
                                Python: BeautifulSoup extract string between div tag by its class
                            
                                Reading from a file with sys.stdin in Pycharm
                            
                                I get a error when using HoughCircles with Python OpenCV that a module is missing [duplicate]
                            
                                Ruby hash equivalent to Python dict setdefault
                            
                                PyEnchant: spellchecking block of text with a personal word list
                            
                                Django Custom View Decorators
                            
                                Ignore certain packages and their dependencies with pip freeze
                            
                                Celery Task with countdown
                            
                                Issue with createsuperuser when implementing custom user model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With