I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files
. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip
option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError
such as this when importing numpy.
File "/path/anonymized/module.py", line 6, in <module> import numpy File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module> # File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module> ImportError: cannot import name multiarray
When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there. Does anyone know, what might cause this problem and how to fix it?
Thanks!
Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is, $ spark-submit --master <url> <SCRIPTNAME>. py .
Spark Submit Python File Apache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.
To run Python scripts with the python command, you need to open a command-line and type in the word python , or python3 if you have both versions, followed by the path to your script, just like this: $ python3 hello.py Hello World!
Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench. Cloudera Data Science Workbench provides freedom for data scientists. It gives them the flexibility to work with their favorite libraries using isolated environments with a container for each project.
First off, I'll assume that your dependencies are listed in requirements.txt
. To package and zip the dependencies, run the following at the command line:
pip install -t dependencies -r requirements.txt cd dependencies zip -r ../dependencies.zip .
Above, the cd dependencies
command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's post for heads up.
Next, submit the job via:
spark-submit --py-files dependencies.zip spark_job.py
The --py-files
directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH
(source of confusion for me). To add the dependencies to the PYTHONPATH
to fix the ImportError
, add the following line to the Spark job, spark_job.py
:
sc.addPyFile("dependencies.zip")
A caveat from this Cloudera post:
An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client’s CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually.
Although the solution above does not build an egg, the same guideline applies.
First you need to pass your files through --py-files or --files
Now in your code, add those zip/files by using the following command
sc.addPyFile("your zip/file")
Now import your zip/file in your code with an alias like the following to start referencing it
import zip/file as your-alias
Note: You need not use file extension while importing, like .py at the end
Hope this is useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With