I have written a class implementing a classifier in python. I would like to use Apache Spark to parallelize classification of a huge number of datapoints using this classifier.
However, when I do the following:
import BoTree
bo_tree = BoTree.train(data)
rdd = sc.parallelize(keyed_training_points) #create rdd of 10 (integer, (float, float) tuples
rdd = rdd.mapValues(lambda point, bt = bo_tree: bt.classify(point[0], point[1]))
out = rdd.collect()
Spark fails with the error (just the relevant bit I think):
File "/root/spark/python/pyspark/worker.py", line 90, in main
command = pickleSer.loads(command.value)
File "/root/spark/python/pyspark/serializers.py", line 405, in loads
return cPickle.loads(obj)
ImportError: No module named BoroughTree
Can anyone help me? Somewhat desperate...
Thanks
Create Column Class Object One of the simplest ways to create a Column class object is by using PySpark lit() SQL function, this takes a literal value and returns a Column object. You can also access the Column from DataFrame by multiple ways.
PySpark uses Py4J, which is a framework that facilitates interoperation between the two languages, to exchange data between the Python and the JVM processes. When you launch a PySpark job, it starts as a Python process, which then spawns a JVM instance and runs some PySpark specific code in it.
Probably the simplest solution is to use pyFiles
argument when you create SparkContext
from pyspark import SparkContext
sc = SparkContext(master, app_name, pyFiles=['/path/to/BoTree.py'])
Every file placed there will be shipped to workers and added to PYTHONPATH
.
If you're working in an interactive mode you have to stop an existing context using sc.stop()
before you create a new one.
Also make sure that Spark worker is actually using Anaconda distribution and not a default Python interpreter. Based on your description it is most likely the problem. To set PYSPARK_PYTHON
you can use conf/spark-env.sh
files.
On a side note copying file to lib
is a rather messy solution. If you want to avoid pushing files using pyFiles
I would recommend creating either plain Python package or Conda package and a proper installation. This way you can easily keep track of what is installed, remove unnecessary packages and avoid some hard to debug problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With