I installed Spark, ran the sbt assembly, and can open bin/pyspark with no problem. However, I am running into problems loading the pyspark module into ipython. I'm getting the following error:
In [1]: import pyspark --------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-1-c15ae3402d12> in <module>() ----> 1 import pyspark /usr/local/spark/python/pyspark/__init__.py in <module>() 61 62 from pyspark.conf import SparkConf ---> 63 from pyspark.context import SparkContext 64 from pyspark.sql import SQLContext 65 from pyspark.rdd import RDD /usr/local/spark/python/pyspark/context.py in <module>() 28 from pyspark.conf import SparkConf 29 from pyspark.files import SparkFiles ---> 30 from pyspark.java_gateway import launch_gateway 31 from pyspark.serializers import PickleSerializer, BatchedSerializer, UTF8Deserializer, \ 32 PairDeserializer, CompressedSerializer /usr/local/spark/python/pyspark/java_gateway.py in <module>() 24 from subprocess import Popen, PIPE 25 from threading import Thread ---> 26 from py4j.java_gateway import java_import, JavaGateway, GatewayClient 27 28 ImportError: No module named py4j.java_gateway
Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects. so Py4J is a mandatory module to run the PySpark application and it is located at $SPARK_HOME/python/lib/py4j-*-src. zip directory.
java_gateway — Py4J Main API. The py4j. java_gateway module defines most of the classes that are needed to use Py4J. Py4J users are expected to only use explicitly JavaGateway and optionally, GatewayParameters , CallbackServerParameters , java_import , get_field , get_method , launch_gateway , and is_instance_of .
In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you'll see that you need a few things added to your PYTHONPATH:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
That worked in ipython for me.
Update: as noted in the comments, the name of the py4j zip file changes with each Spark release, so look around for the right name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With