I am trying the following code which adds a number to every row in an RDD and returns a list of RDDs using PySpark. <pre class="prettyprint"><code>from pyspark.context import SparkContext file = "file:///home/sree/code/scrap/sample.txt" sc = SparkContext('local', 'TestApp') data = sc.textFile(file) splits = [data.map(lambda p : int(p) + i) for i in range(4)] print splits[0].collect() print splits[1].collect() print splits[2].collect() </code></pre> The content in the input file (sample.txt) is: <pre class="prettyprint"><code>1 2 3 </code></pre> I was expecting an output like this (adding the numbers in the rdd with 0, 1, 2 respectively): <pre class="prettyprint"><code>[1,2,3] [2,3,4] [3,4,5] </code></pre> whereas the actual output was : <pre class="prettyprint"><code>[4, 5, 6] [4, 5, 6] [4, 5, 6] </code></pre> which means that the comprehension used only the value 3 for variable i, irrespective of the range(4). Why does this behavior happen ?

It happens because of Python late binding and is not (Py)Spark specific. <code>i</code> will be looked-up when <code>lambda p : int(p) + i</code> is used, not when it is defined. Typically it means when it is called but in this particular context it is when it is serialized to be send to the workers. You can do for example something like this: <pre class="prettyprint"><code>def f(i): def _f(x): try: return int(x) + i except: pass return _f data = sc.parallelize(["1", "2", "3"]) splits = [data.map(f(i)) for i in range(4)] [rdd.collect() for rdd in splits] ## [[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]] </code></pre>

PySpark Evaluation

Tags:

python

apache-spark

pyspark

I am trying the following code which adds a number to every row in an RDD and returns a list of RDDs using PySpark.

from pyspark.context import SparkContext
file  = "file:///home/sree/code/scrap/sample.txt"
sc = SparkContext('local', 'TestApp')
data = sc.textFile(file) 
splits = [data.map(lambda p :  int(p) + i) for i in range(4)]
print splits[0].collect()
print splits[1].collect()
print splits[2].collect()

The content in the input file (sample.txt) is:

1
2
3

I was expecting an output like this (adding the numbers in the rdd with 0, 1, 2 respectively):

[1,2,3]
[2,3,4]
[3,4,5]

whereas the actual output was :

[4, 5, 6]
[4, 5, 6]
[4, 5, 6]

which means that the comprehension used only the value 3 for variable i, irrespective of the range(4).

Why does this behavior happen ?

462

asked Jun 28 '16 18:06

srjit

2 Answers

It happens because of Python late binding and is not (Py)Spark specific. i will be looked-up when lambda p : int(p) + i is used, not when it is defined. Typically it means when it is called but in this particular context it is when it is serialized to be send to the workers.

You can do for example something like this:

def f(i):
    def _f(x):
        try:
            return int(x) + i
        except:
            pass
    return _f

data = sc.parallelize(["1", "2", "3"])
splits = [data.map(f(i)) for i in range(4)]
[rdd.collect() for rdd in splits]
## [[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

answered Oct 19 '22 17:10

zero323

This is due to to the fact that lambdas refer to the i via reference! It has nothing to do with spark. See this

You can try this:

a =[(lambda y: (lambda x: y + int(x)))(i) for i in range(4)]
splits = [data.map(a[x]) for x in range(4)]

or in one line

splits = [
    data.map([(lambda y: (lambda x: y + int(x)))(i) for i in range(4)][x])
    for x in range(4)
]

answered Oct 19 '22 16:10

Himaprasoon

Related questions
                            
                                Downloading file from S3 using boto3 inside Docker fails
                            
                                Django REST Framework: how to make verbose name of field differ from its field_name?
                            
                                Django TypeError("'%s' is an invalid keyword argument for this function")
                            
                                PyQt QDialog return response yes or no
                            
                                is it possible to restart the already terminated process in python multiprocessing?
                            
                                RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9
                            
                                Python Read Fortran Binary File
                            
                                Python comments Fail using """ or ''' in dictionary [duplicate]
                            
                                Area intersection in Python
                            
                                Keras/Tensorflow predict: error in array shape
                            
                                Accessing a variable of the another program in C
                            
                                Python: reading 12 bit packed binary image
                            
                                ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df
                            
                                Replace values in column of Pandas DataFrame using a Series lookup table
                            
                                Python: accept unicode strings as regular strings in doctests
                            
                                How can I asyncio schedule a filesystem stat operation?
                            
                                How to Make a Portable Jupyter Slideshow
                            
                                Django F doesn't seem to work?
                            
                                Splash lua script to do multiple clicks and visits
                            
                                Jupyter & IPython: What does %matplotlib inline do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With