Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy and static linking

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:

Traceback (most recent call last):
  File "/home/user/spark-script.py", line 12, in <module>
    import numpy
  File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module>
ImportError: cannot import name multiarray

The script is actually quite simple:

from pyspark import SparkConf, SparkContext
sc = SparkContext()

sc.addPyFile('numpy.zip')

import numpy

a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()

I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpy module with static linking?

Thanks.

like image 392
abhinavkulkarni Avatar asked Dec 19 '15 23:12

abhinavkulkarni


1 Answers

There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.

  • First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.

  • NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.

Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.

If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:

try:
    import numpy as np

expect ImportError:
    import pip
    pip.main(["install", "--user", "numpy"])
    import numpy as np

You'll find other variants of this method in How to install and import Python modules at runtime?

Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).

See also: shipping python modules in pyspark to other nodes?

like image 97
zero323 Avatar answered Nov 15 '22 23:11

zero323