I have similar import error on Spark executors as described here, just with psycopg2: ImportError: No module named numpy on spark workers
Here it says "Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors".
So the question is "How to create egg file from package and it dependencies?" Or wheel, in case eggs are legacy. Is there any command for this in pip?
If you want to list all the Python packages installed in an environment, pip list command is what you are looking for. The command will return all the packages installed, along with their specific version and location. If a package is installed from a remote host (for example PyPI or Nexus) the location will be empty.
You want to be making a wheel. They are newer, more robust than eggs, and are supported by both Python 2/3.
For something as popular as numpy, you don't need to bother making the wheel yourself. They package wheels in their distribution, so you can just download it. Many python libraries will have a wheel as part of their distribution. See here: https://pypi.python.org/pypi/numpy
If you're curious, see here how to make one in general: https://pip.pypa.io/en/stable/reference/pip_wheel/.
Alternatively, you could just install numpy on your target workers.
EDIT:
After your comments, I think it's pertinent to mention the pipdeptree utility. If you need to see by hand what the pip dependencies are, this utility will list them for you. Here's an example:
$ pipdeptree
3to2==1.1.1
anaconda-navigator==1.2.1
ansible==2.2.1.0
- jinja2 [required: <2.9, installed: 2.8]
- MarkupSafe [required: Any, installed: 0.23]
- paramiko [required: Any, installed: 2.1.1]
- cryptography [required: >=1.1, installed: 1.4]
- cffi [required: >=1.4.1, installed: 1.6.0]
- pycparser [required: Any, installed: 2.14]
- enum34 [required: Any, installed: 1.1.6]
- idna [required: >=2.0, installed: 2.1]
- ipaddress [required: Any, installed: 1.0.16]
- pyasn1 [required: >=0.1.8, installed: 0.1.9]
- setuptools [required: >=11.3, installed: 23.0.0]
- six [required: >=1.4.1, installed: 1.10.0]
- pyasn1 [required: >=0.1.7, installed: 0.1.9]
- pycrypto [required: >=2.6, installed: 2.6.1]
- PyYAML [required: Any, installed: 3.11]
- setuptools [required: Any, installed: 23.0.0
If you're using Pyspark and need to package your dependencies, pip can't do this for you automatically. Pyspark has its own dependency management that pip knows nothing about. The best you can do is list the dependencies and shove them over by hand, as far as I know.
Additionally, Pyspark isn't dependent on numpy or psycopg2, so pip can't possibly tell you that you'd need them if all you're telling pip is your version of Pyspark. That dependency has been introduced by you, so you're responsible for giving it to Pyspark.
As a side note, we use bootstrap scripts that install our dependencies (like numpy) before we boot our clusters. It seems to work well. That way you list the libs you need once in a script, and then you can forget about it.
HTH.
You can install wheel
using pip install wheel
.
Then create a .whl using python setup.py bdist_wheel
. You'll find it in the dist
directory in root directory of the python package. You might also want to pass --universal
if you want a single .whl file for both python 2 and python 3.
More info on wheel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With