No module named numpy when spark-submitting

Question

I’m spark-submitting a python file that imports numpy but I’m getting a no module named numpy error.

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

I was thinking I would pull in an egg for numpy —python-files, but I'm having trouble figuring out how to build that egg. But then it occurred to me that pyspark itself uses numpy. It would be silly to pull in my own version of numpy.

Any idea on the appropriate thing to do here?

Def_Os · Accepted Answer

It looks like Spark is using a version of Python that does not have numpy installed. It could be because you are working inside a virtual environment.

Try this:

# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable

Hans Peter Hagblom · Answer

I got this to work by installing numpy on all the emr-nodes by configuring a small bootstrapping script that contains the following (among other things).

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

Then configure the bootstrap script to be executed when you start your cluster by adding the following option to the aws emr command (the following example gives an argument to the bootstrap script)

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

This can be used when setting up a cluster automatically from DataPipeline as well.

No module named numpy when spark-submitting

Tags:

numpy

apache-spark

pyspark

JnBrymn

2 Answers

Def_Os

Hans Peter Hagblom

Recent Activity

Donate For Us

No module named numpy when spark-submitting

Tags:

numpy

apache-spark

pyspark

JnBrymn

2 Answers

Def_Os

Hans Peter Hagblom

Related questions

Recent Activity

Donate For Us