I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy
is not installed on the worker nodes. Hence, I bundled numpy
with my program, but I get the following error:
Traceback (most recent call last):
File "/home/user/spark-script.py", line 12, in <module>
import numpy
File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray
The script is actually quite simple:
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc.addPyFile('numpy.zip')
import numpy
a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()
I understand that the error occurs because numpy
dynamically loads multiarray.so
dependency and even if my numpy.zip
file includes multiarray.so
file, somehow the dynamic loading doesn't work with Apache Spark
. Why so? And how do you othewise create a standalone numpy
module with static linking?
Thanks.
There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.
First of all Debian packages come with multiple dependencies including libgfortran
, libblas
, liblapack
and libquadmath
. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.
NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.
Arguably this makes NumPy a really bad candidate for being shipped with pyFile
method.
If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip
is installed as well) with something like this:
try:
import numpy as np
expect ImportError:
import pip
pip.main(["install", "--user", "numpy"])
import numpy as np
You'll find other variants of this method in How to install and import Python modules at runtime?
Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).
See also: shipping python modules in pyspark to other nodes?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With