I am running some attribute lookup problems when trying to initiate a class within my RDD.
my workflow:
1- Start with an RDD
2- Take each element of the RDD, initiate an object for each
3- Reduce (I will write a method that will define the reduce operation later on)
Here is #2:
>class test(object):
def __init__(self, a,b):
self.total = a + b
>a = sc.parallelize([(True,False),(False,False)])
>a.map(lambda (x,y): test(x,y))
Here is the error I get:
PicklingError: Can't pickle < class 'main.test' >: attribute lookup main.test failed
I'd like to know if there is any way around it. Please, answer with a working example to achieve the intended results (i.e. creating a RDD of objects of class "tests").
Related questions:
https://groups.google.com/forum/#!topic/edx-code/9xzRJFyQwnI
Can't pickle <type 'instancemethod'> when using python's multiprocessing Pool.map()
From Davies Liu (DataBricks):
"Currently, PySpark can not support pickle a class object in current script ( 'main'), the workaround could be put the implementation of the class into a separate module, then use "bin/spark-submit --py-files xxx.py" in deploy it.
in xxx.py:
class test(object):
def __init__(self, a, b):
self.total = a + b
in job.py:
from xxx import test
a = sc.parallelize([(True,False),(False,False)])
a.map(lambda (x,y): test(x,y))
run it by:
bin/spark-submit --py-files xxx.py job.py
"
Just want to point out that you can pass the same argument (--py-files) in the Spark Shell too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With