Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't PySpark find py4j.java_gateway?

I installed Spark, ran the sbt assembly, and can open bin/pyspark with no problem. However, I am running into problems loading the pyspark module into ipython. I'm getting the following error:

In [1]: import pyspark --------------------------------------------------------------------------- ImportError                               Traceback (most recent call last) <ipython-input-1-c15ae3402d12> in <module>() ----> 1 import pyspark  /usr/local/spark/python/pyspark/__init__.py in <module>()      61      62 from pyspark.conf import SparkConf ---> 63 from pyspark.context import SparkContext      64 from pyspark.sql import SQLContext      65 from pyspark.rdd import RDD  /usr/local/spark/python/pyspark/context.py in <module>()      28 from pyspark.conf import SparkConf      29 from pyspark.files import SparkFiles ---> 30 from pyspark.java_gateway import launch_gateway      31 from pyspark.serializers import PickleSerializer, BatchedSerializer, UTF8Deserializer, \      32     PairDeserializer, CompressedSerializer  /usr/local/spark/python/pyspark/java_gateway.py in <module>()      24 from subprocess import Popen, PIPE      25 from threading import Thread ---> 26 from py4j.java_gateway import java_import, JavaGateway, GatewayClient      27      28  ImportError: No module named py4j.java_gateway 
like image 489
user592419 Avatar asked Oct 23 '14 16:10

user592419


People also ask

Where is Py4J located?

Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects. so Py4J is a mandatory module to run the PySpark application and it is located at $SPARK_HOME/python/lib/py4j-*-src. zip directory.

What is Py4J Java_gateway?

java_gateway — Py4J Main API. The py4j. java_gateway module defines most of the classes that are needed to use Py4J. Py4J users are expected to only use explicitly JavaGateway and optionally, GatewayParameters , CallbackServerParameters , java_import , get_field , get_method , launch_gateway , and is_instance_of .


1 Answers

In my environment (using docker and the image sequenceiq/spark:1.1.0-ubuntu), I ran in to this. If you look at the pyspark shell script, you'll see that you need a few things added to your PYTHONPATH:

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 

That worked in ipython for me.

Update: as noted in the comments, the name of the py4j zip file changes with each Spark release, so look around for the right name.

like image 165
nealmcb Avatar answered Oct 06 '22 22:10

nealmcb