I'm implementing a model in Spark as a python class, and any time I try to map a class method to an RDD it fails. My actual code is more complicated, but this simplified version gets at the heart of the problem:
class model(object):
def __init__(self):
self.data = sc.textFile('path/to/data.csv')
# other misc setup
def run_model(self):
self.data = self.data.map(self.transformation_function)
def transformation_function(self,row):
row = row.split(',')
return row[0]+row[1]
Now, if I run the model like so (for example):
test = model()
test.run_model()
test.data.take(10)
I get the following error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I've played with this a bit, and it seems to reliably occur anytime I try to map a class method to an RDD within the class. I have confirmed that the mapped function works fine if I implement outside of a class structure, so the problem definitely has to do with the class. Is there a way to resolve this?
Problem here is a little bit more subtle than using nested RDDs or performing Spark actions inside of transformations. Spark doesn't allow access to the SparkContext
inside action or transformation.
Even you don't access it explicitly it is referenced inside the closure and has to be serialized and carried around. It means that your transformation
method, which references self
, keeps SparkContext
as well, hence the error.
One way to handle this is to use static method:
class model(object):
@staticmethod
def transformation_function(row):
row = row.split(',')
return row[0]+row[1]
def __init__(self):
self.data = sc.textFile('some.csv')
def run_model(self):
self.data = self.data.map(model.transformation_function)
Edit:
If you want to be able to access instance variables you can try something like this:
class model(object):
@staticmethod
def transformation_function(a_model):
delim = a_model.delim
def _transformation_function(row):
return row.split(delim)
return _transformation_function
def __init__(self):
self.delim = ','
self.data = sc.textFile('some.csv')
def run_model(self):
self.data = self.data.map(model.transformation_function(self))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With