Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon EMR Pyspark Module not found

I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.

I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error:

from pyspark import SparkContext
ImportError: No module named pyspark

How do I fix this?

like image 594
Stephen Cheng Avatar asked Aug 12 '15 22:08

Stephen Cheng


People also ask

How do I open PySpark shell in EMR?

You can access the Spark shell by connecting to the master node with SSH and invoking spark-shell . For more information about connecting to the master node, see Connect to the master node using SSH in the Amazon EMR Management Guide. The following examples use Apache HTTP Server access logs stored in Amazon S3.


2 Answers

I add the following lines to ~/.bashrc for emr 4.3:

export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Here py4j-0.XXX-src.zip is the py4j file in your spark python library folder. Search /usr/lib/spark/python/lib/ to find the exact version and replace the XXX with that version number.

Run source ~/.bashrc and you should be good.

like image 134
Bob Baxley Avatar answered Oct 16 '22 11:10

Bob Baxley


You probably need to add the pyspark files to the path. I typically use a function like the following.

def configure_spark(spark_home=None, pyspark_python=None):
    spark_home = spark_home or "/path/to/default/spark/home"
    os.environ['SPARK_HOME'] = spark_home

    # Add the PySpark directories to the Python path:
    sys.path.insert(1, os.path.join(spark_home, 'python'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))

    # If PySpark isn't specified, use currently running Python binary:
    pyspark_python = pyspark_python or sys.executable
    os.environ['PYSPARK_PYTHON'] = pyspark_python

Then, you can call the function before importing pyspark:

configure_spark('/path/to/spark/home')
from pyspark import SparkContext

Spark home on an EMR node should be something like /home/hadoop/spark. See https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 for more details.

like image 39
santon Avatar answered Oct 16 '22 11:10

santon