How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster? I know that I can ship Python files in Spark using a standalone Python script (example code below): <pre class="prettyprint"><code>from pyspark import SparkContext sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py']) </code></pre> But in situations where there is no '.py', how do I ship the module?

If you can package your module into a <code>.egg</code> or <code>.zip</code> file, you should be able to list it in <code>pyFiles</code> when constructing your SparkContext (or you can add it later through sc.addPyFile). For Python libraries that use setuptools, you can run <code>python setup.py bdist_egg</code> to build an egg distribution. Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).

There are two main options here: <ul> <li> If it's a single file or a <code>.zip</code>/<code>.egg</code>, pass it to <code>SparkContext.addPyFile</code>.</li> <li> Insert <code>pip install</code> into a bootstrap code for the cluster's machines. <ul> <li> Some cloud platforms (DataBricks in this case) have UI to make this easier.</li> </ul> </li> </ul> People also suggest using <code>python shell</code> to test if the module is present on the cluster.

Shipping Python modules in pyspark to other nodes

Tags:

python

apache-spark

How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?

I know that I can ship Python files in Spark using a standalone Python script (example code below):

from pyspark import SparkContext sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])

But in situations where there is no '.py', how do I ship the module?

711

asked Jul 10 '14 21:07

mgoldwasser

2 Answers

If you can package your module into a .egg or .zip file, you should be able to list it in pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).

For Python libraries that use setuptools, you can run python setup.py bdist_egg to build an egg distribution.

Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).

176

answered Oct 06 '22 04:10

Josh Rosen

There are two main options here:

If it's a single file or a .zip/.egg, pass it to SparkContext.addPyFile.
Insert pip install into a bootstrap code for the cluster's machines.
- Some cloud platforms (DataBricks in this case) have UI to make this easier.

People also suggest using python shell to test if the module is present on the cluster.

answered Oct 06 '22 04:10

ivan_pozdeev

Related questions
                            
                                Showing ValueError: shapes (1,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)
                            
                                Converting a geopandas geodataframe into a pandas dataframe
                            
                                TypeError: cannot unpack non-iterable NoneType object
                            
                                How can I hide the console window when freezing wxPython applications with cxFreeze?
                            
                                Efficient Python to Python IPC [closed]
                            
                                Clojure equivalent to Python's "any" and "all" functions?
                            
                                Interpolation on DataFrame in pandas
                            
                                Pythonic way to access arbitrary element from dictionary [duplicate]
                            
                                Python module to change system date and time
                            
                                What is the difference between np.sum and np.add.reduce?
                            
                                How to store dictionary in HDF5 dataset
                            
                                Control the pip version in virtualenv
                            
                                Mocking a subprocess call in Python
                            
                                How to set request args with Flask test_client?
                            
                                Difference between Dense and Activation layer in Keras
                            
                                Spectral Clustering a graph in python
                            
                                Unable to resolve " not a valid key=value pair (missing equal-sign) in Authorization header" when POSTing to api gateway
                            
                                Can I iterate over a class in Python?
                            
                                Creating a new function as return in python function?
                            
                                Compiling Python 3.4 is not copying pip

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With