Spark returning Pickle error: cannot lookup attribute

Question

I am running some attribute lookup problems when trying to initiate a class within my RDD.

my workflow:

1- Start with an RDD

2- Take each element of the RDD, initiate an object for each

3- Reduce (I will write a method that will define the reduce operation later on)

Here is #2:

>class test(object):
def __init__(self, a,b):
    self.total = a + b

>a = sc.parallelize([(True,False),(False,False)])
>a.map(lambda (x,y): test(x,y))

Here is the error I get:

PicklingError: Can't pickle < class 'main.test' >: attribute lookup main.test failed

I'd like to know if there is any way around it. Please, answer with a working example to achieve the intended results (i.e. creating a RDD of objects of class "tests").

Related questions:

https://groups.google.com/forum/#!topic/edx-code/9xzRJFyQwnI
Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map()

Guillaume · Accepted Answer

From Davies Liu (DataBricks):

"Currently, PySpark can not support pickle a class object in current script ( 'main'), the workaround could be put the implementation of the class into a separate module, then use "bin/spark-submit --py-files xxx.py" in deploy it.

in xxx.py:

class test(object):
     def __init__(self, a, b):
        self.total = a + b

in job.py:

from xxx import test
a = sc.parallelize([(True,False),(False,False)])
a.map(lambda (x,y): test(x,y))

run it by:

bin/spark-submit --py-files xxx.py job.py

"

Just want to point out that you can pass the same argument (--py-files) in the Spark Shell too.

Spark returning Pickle error: cannot lookup attribute

Tags:

python

pickle

apache-spark

Guillaume

1 Answers

Guillaume

Recent Activity

Donate For Us