Numpy and static linking

Question

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:

Traceback (most recent call last):
  File "/home/user/spark-script.py", line 12, in <module>
    import numpy
  File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray

The script is actually quite simple:

from pyspark import SparkConf, SparkContext
sc = SparkContext()

sc.addPyFile('numpy.zip')

import numpy

a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()

I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpy module with static linking?

Thanks.

zero323 · Accepted Answer

There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.

First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.
NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.

Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.

If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:

try:
    import numpy as np

expect ImportError:
    import pip
    pip.main(["install", "--user", "numpy"])
    import numpy as np

You'll find other variants of this method in How to install and import Python modules at runtime?

Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).

See also: shipping python modules in pyspark to other nodes?

Numpy and static linking

Tags:

python

numpy

apache-spark

pyspark

abhinavkulkarni

1 Answers

zero323

Recent Activity

Donate For Us

Numpy and static linking

Tags:

python

numpy

apache-spark

pyspark

abhinavkulkarni

1 Answers

zero323

Related questions

Recent Activity

Donate For Us