Error from python worker: /bin/python: No module named pyspark

Question

I am trying to establish a nice spark development environment by using ipython. First fire up ipython, then:

import findspark
findspark.init()

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
sc = SparkContext(conf=conf)

This is from application UI, I can see that executors are up on worker nodes.

application ui

However when I try this:

rdd = sc.textFile("/LOGS/201511/*/*")
rdd.first()

I get this:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, d142.dtvhadooptest.com): org.apache.spark.SparkException:
Error from python worker:
  /bin/python: No module named pyspark
PYTHONPATH was:
  /data/sdb/hadoop/yarn/local/usercache/hdfs/filecache/64/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
        at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Can anyone help me out?

gunererd · Accepted Answer

So setting those two extra configurations did the trick.

conf.set('spark.yarn.dist.files','file:/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip,file:/usr/hdp/2.3.2.0-2950/spark/python/lib/py4j-0.8.2.1-src.zip')
conf.setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.8.2.1-src.zip')

vks2106 · Answer

In cloudera CDH

conf.set('spark.yarn.dist.files','file:/path/to/pyspark.zip,file:/path/to/py4j-0.8.2.1-src.zip')
conf.setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.8.2.1-src.zip')

The above snippet solved my issue but i didn't have the rights to change the spark application code. To solve, Check your PYTHONPATH has these two zips added. In my case the default PYTHONPATH was using hardcoded nodenames in the path to these files. With below i don't need to change the application code

export PYTHONPATH=$PYTHONPATH:/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip
export PYTHONPATH=$PYTHONPATH:/opt/cloudera/parcels/CDH/lib/spark/python/lib/pyspark.zip

Error from python worker: /bin/python: No module named pyspark

Tags:

python

ipython

ipython-notebook

apache-spark

pyspark

gunererd

2 Answers

gunererd

vks2106

Recent Activity

Donate For Us

Error from python worker: /bin/python: No module named pyspark

Tags:

python

ipython

ipython-notebook

apache-spark

pyspark

gunererd

2 Answers

gunererd

vks2106

Related questions

Recent Activity

Donate For Us