I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using <code>spark-submit --py-files</code>. I package the contents of site-packages in a ZIP file and submit the job like with <code>--py-files=dependencies.zip</code> option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw <code>ImportError</code> such as this when importing numpy. <pre class="prettyprint"><code>File "/path/anonymized/module.py", line 6, in <module> import numpy File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module> # File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module> ImportError: cannot import name multiarray </code></pre> When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there. Does anyone know, what might cause this problem and how to fix it? Thanks!

First off, I'll assume that your dependencies are listed in <code>requirements.txt</code>. To package and zip the dependencies, run the following at the command line: <pre class="prettyprint"><code>pip install -t dependencies -r requirements.txt cd dependencies zip -r ../dependencies.zip . </code></pre> Above, the <code>cd dependencies</code> command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's post for heads up. Next, submit the job via: <pre class="prettyprint"><code>spark-submit --py-files dependencies.zip spark_job.py </code></pre> The <code>--py-files</code> directive sends the zip file to the Spark workers but does not add it to the <code>PYTHONPATH</code> (source of confusion for me). To add the dependencies to the <code>PYTHONPATH</code> to fix the <code>ImportError</code>, add the following line to the Spark job, <code>spark_job.py</code>: <pre class="prettyprint"><code>sc.addPyFile("dependencies.zip") </code></pre> A caveat from this Cloudera post: <blockquote> An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client’s CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually. </blockquote> Although the solution above does not build an egg, the same guideline applies.

<ul> <li> First you need to pass your files through --py-files or --files <ul> <li>When you pass your zip/files with the above flags, basically your resources will be transferred to temporary directory created on HDFS just for the lifetime of that application.</li> </ul> </li> <li> Now in your code, add those zip/files by using the following command <code>sc.addPyFile("your zip/file")</code> <ul> <li>what the above does is, it loads the files to the execution environment, like JVM.</li> </ul> </li> <li> Now import your zip/file in your code with an alias like the following to start referencing it <code>import zip/file as your-alias</code> Note: You need not use file extension while importing, like .py at the end </li> </ul> Hope this is useful.

I can't seem to get --py-files on Spark to work

Tags:

python

apache-spark

pyspark

I'm having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError such as this when importing numpy.

File "/path/anonymized/module.py", line 6, in <module>     import numpy File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module>    File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module>     # File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module> File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module> ImportError: cannot import name multiarray

When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there. Does anyone know, what might cause this problem and how to fix it?

Thanks!

983

asked Apr 06 '16 19:04

Andrej Palicka

2 Answers

First off, I'll assume that your dependencies are listed in requirements.txt. To package and zip the dependencies, run the following at the command line:

pip install -t dependencies -r requirements.txt cd dependencies zip -r ../dependencies.zip .

Above, the cd dependencies command is crucial to ensure that the modules are the in the top level of the zip file. Thanks to Dan Corin's post for heads up.

Next, submit the job via:

spark-submit --py-files dependencies.zip spark_job.py

The --py-files directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH (source of confusion for me). To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, spark_job.py:

sc.addPyFile("dependencies.zip")

A caveat from this Cloudera post:

An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous. A Python egg built on a client machine will be specific to the client’s CPU architecture because of the required C compilation. Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually.

Although the solution above does not build an egg, the same guideline applies.

answered Sep 22 '22 17:09

ramhiser

First you need to pass your files through --py-files or --files
- When you pass your zip/files with the above flags, basically your resources will be transferred to temporary directory created on HDFS just for the lifetime of that application.
Now in your code, add those zip/files by using the following command

sc.addPyFile("your zip/file")
- what the above does is, it loads the files to the execution environment, like JVM.
Now import your zip/file in your code with an alias like the following to start referencing it

import zip/file as your-alias

Note: You need not use file extension while importing, like .py at the end

Hope this is useful.

answered Sep 21 '22 17:09

avrsanjay

Related questions
                            
                                sklearn : TFIDF Transformer : How to get tf-idf values of given words in document
                            
                                Write a Pandas DataFrame to Google Cloud Storage or BigQuery
                            
                                Is it possible to list all functions in a module? [duplicate]
                            
                                How do I do dependency parsing in NLTK?
                            
                                Python - Use 'set' to find the different items in list
                            
                                Can't find Python executable "python"
                            
                                Prevent TensorFlow from accessing the GPU? [duplicate]
                            
                                Add leading Zero Python [duplicate]
                            
                                pandas replace multiple values one column
                            
                                Protected method in python [duplicate]
                            
                                PySerial non-blocking read loop
                            
                                Renaming multiple files in a directory using Python
                            
                                Change Django ModelChoiceField to show users' full names rather than usernames
                            
                                How can I open a website with urllib via proxy in Python?
                            
                                IndentationError: unexpected unindent WHY?
                            
                                Exiting Python Debugger ipdb
                            
                                PrettyPrint python into a string, and not stdout
                            
                                How to make a histogram from a list of strings in Python?
                            
                                Parse http GET and POST parameters from BaseHTTPHandler?
                            
                                graphing an equation with matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With