Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shipping Python modules in pyspark to other nodes

How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?

I know that I can ship Python files in Spark using a standalone Python script (example code below):

from pyspark import SparkContext sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py']) 

But in situations where there is no '.py', how do I ship the module?

like image 711
mgoldwasser Avatar asked Jul 10 '14 21:07

mgoldwasser


People also ask

Can we use Python libraries in PySpark?

Since Python 3.3, a subset of its features has been integrated into Python as a standard library under the venv module. PySpark users can use virtualenv to manage Python dependencies in their clusters by using venv-pack in a similar way as conda-pack.

How do I copy a package from one Python to another?

zip the contents of C:\python27 to an USB key. copy all python DLLS: copy C:\windows\system32\py*DLL K: (if K is your usb drive) unzip the contents of the archive somewhere on the second machine. add the DLLs directly in the python27 directory.


2 Answers

If you can package your module into a .egg or .zip file, you should be able to list it in pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).

For Python libraries that use setuptools, you can run python setup.py bdist_egg to build an egg distribution.

Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).

like image 176
Josh Rosen Avatar answered Oct 06 '22 04:10

Josh Rosen


There are two main options here:

  • If it's a single file or a .zip/.egg, pass it to SparkContext.addPyFile.
  • Insert pip install into a bootstrap code for the cluster's machines.
    • Some cloud platforms (DataBricks in this case) have UI to make this easier.

People also suggest using python shell to test if the module is present on the cluster.

like image 42
ivan_pozdeev Avatar answered Oct 06 '22 04:10

ivan_pozdeev