How to use custom classes with Apache Spark (pyspark)?

Tags:

I have written a class implementing a classifier in python. I would like to use Apache Spark to parallelize classification of a huge number of datapoints using this classifier.

I'm set up using Amazon EC2 on a cluster with 10 slaves, based off an ami that comes with python's Anaconda distribution on it. The ami lets me use IPython Notebook remotely.
I've defined the class BoTree in a file call BoTree.py on the master in the folder /root/anaconda/lib/python2.7/ which is where all my python modules are
I've checked that I can import and use BoTree.py when running command line spark from the master (I just have to start by writing import BoTree and my class BoTree becomes available
I've used spark's /root/spark-ec2/copy-dir.sh script to copy the /python2.7/ directory across my cluster.
I've ssh-ed into one of the slaves and tried running ipython there, and was able to import BoTree, so I think the module has been sent across the cluster successfully (I can also see the BoTree.py file in the .../python2.7/ folder)
On the master I've checked I can pickle and unpickle a BoTree instance using cPickle, which I understand is pyspark's serializer.

However, when I do the following:

import BoTree
bo_tree = BoTree.train(data)
rdd = sc.parallelize(keyed_training_points) #create rdd of 10 (integer, (float, float) tuples
rdd = rdd.mapValues(lambda point, bt = bo_tree: bt.classify(point[0], point[1]))
out = rdd.collect()

Spark fails with the error (just the relevant bit I think):

  File "/root/spark/python/pyspark/worker.py", line 90, in main
    command = pickleSer.loads(command.value)
  File "/root/spark/python/pyspark/serializers.py", line 405, in loads
    return cPickle.loads(obj)
ImportError: No module named BoroughTree

Can anyone help me? Somewhat desperate...

Thanks

863

asked Jun 27 '15 20:06

user3279453

1 Answers

Probably the simplest solution is to use pyFiles argument when you create SparkContext

from pyspark import SparkContext
sc = SparkContext(master, app_name, pyFiles=['/path/to/BoTree.py'])

Every file placed there will be shipped to workers and added to PYTHONPATH.

If you're working in an interactive mode you have to stop an existing context using sc.stop() before you create a new one.

Also make sure that Spark worker is actually using Anaconda distribution and not a default Python interpreter. Based on your description it is most likely the problem. To set PYSPARK_PYTHON you can use conf/spark-env.sh files.

On a side note copying file to lib is a rather messy solution. If you want to avoid pushing files using pyFiles I would recommend creating either plain Python package or Conda package and a proper installation. This way you can easily keep track of what is installed, remove unnecessary packages and avoid some hard to debug problems.

answered Oct 04 '22 16:10

zero323

Related questions
                            
                                How find values in an array that meet two conditions using Python
                            
                                Python list to store class instance?
                            
                                AttributeError when unpickling an object
                            
                                Using variables in Python regular expression [duplicate]
                            
                                How can I make a unique value priority queue in Python?
                            
                                How do I change nesting function's variable in the nested function
                            
                                Terminal text becomes invisible after terminating subprocess
                            
                                Using Python Fabric without the command-line tool (fab)
                            
                                python lazy variables? or, delayed expensive computation
                            
                                Understanding weird boolean 2d-array indexing behavior in numpy
                            
                                Clearing specific cache in Django
                            
                                Scrapy and response status code: how to check against it?
                            
                                Shared library dependencies with distutils
                            
                                How do Homebrew, PIP, easy_install etc. work so that I can clean up
                            
                                How to get file name of current template inside Jinja2 template?
                            
                                Removing duplicates using custom comparisons
                            
                                Does 'not e in c' differ from 'e not in c' in Python? [duplicate]
                            
                                Storing multidimensional arrays in pandas DataFrame columns
                            
                                Python concatenation vs append speed on lists
                            
                                anaconda python distribution completely free even for commercial use? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use custom classes with Apache Spark (pyspark)?

Tags:

python

python-module

apache-spark

pyspark

user3279453

People also ask

1 Answers

zero323

Recent Activity

Donate For Us