Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

No module named numpy when spark-submitting

I’m spark-submitting a python file that imports numpy but I’m getting a no module named numpy error.

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

I was thinking I would pull in an egg for numpy —python-files, but I'm having trouble figuring out how to build that egg. But then it occurred to me that pyspark itself uses numpy. It would be silly to pull in my own version of numpy.

Any idea on the appropriate thing to do here?

like image 830
JnBrymn Avatar asked Apr 04 '15 17:04

JnBrymn


2 Answers

It looks like Spark is using a version of Python that does not have numpy installed. It could be because you are working inside a virtual environment.

Try this:

# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable
like image 172
Def_Os Avatar answered Oct 04 '22 20:10

Def_Os


I got this to work by installing numpy on all the emr-nodes by configuring a small bootstrapping script that contains the following (among other things).

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

Then configure the bootstrap script to be executed when you start your cluster by adding the following option to the aws emr command (the following example gives an argument to the bootstrap script)

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

This can be used when setting up a cluster automatically from DataPipeline as well.

like image 21
Hans Peter Hagblom Avatar answered Oct 04 '22 21:10

Hans Peter Hagblom