Pyspark module not found

Tags:

I'm trying to execute a simple Pyspark job in Yarn. This is the code:

from pyspark import SparkConf, SparkContext

conf = (SparkConf()
         .setMaster("yarn-client")
         .setAppName("HDFS Filter")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)

inputFile = sc.textFile("hdfs://myserver:9000/1436304078054.json.gz").cache()
matchTerm = "spark"
numMatches = inputFile.filter(lambda line: matchTerm in line).count()
print(numMatches, "lines contain", matchTerm)

I don't know if the code will work and that is not the point. The problem is that when I run it with the command ./bin/pyspark ../job.py from inside spark directory, I get the next error (just an small park of the whole output):

15/09/01 17:57:02 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop-05:44841 (size: 3.8 KB, free: 534.5 MB)
15/09/01 17:57:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoop-05): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
  /usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
    at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

15/09/01 17:57:02 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, hadoop-03, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop-03:33268 (size: 3.8 KB, free: 534.5 MB)
15/09/01 17:57:05 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, hadoop-03): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
  /usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/21/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
    at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

15/09/01 17:57:05 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, hadoop-05, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on executor hadoop-05: org.apache.spark.SparkException (
Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
  /usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException) [duplicate 1]
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, hadoop-05, RACK_LOCAL, 1475 bytes)
15/09/01 17:57:05 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on executor hadoop-05: org.apache.spark.SparkException (
Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
  /usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException) [duplicate 2]
15/09/01 17:57:05 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
15/09/01 17:57:05 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/09/01 17:57:05 INFO cluster.YarnScheduler: Cancelling stage 0
15/09/01 17:57:05 INFO scheduler.DAGScheduler: ResultStage 0 (count at /home/hduser/spark-1.4.1-bin-without-hadoop/../test.py:11) failed in 5.093 s
15/09/01 17:57:05 INFO scheduler.DAGScheduler: Job 0 failed: count at /home/hduser/spark-1.4.1-bin-without-hadoop/../test.py:11, took 5.238381 s
Traceback (most recent call last):
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/../test.py", line 11, in <module>
numMatches = inputFile.filter(lambda line: matchTerm in line).count()
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 984, in count
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 975, in sum
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 852, in fold
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/pyspark.zip/pyspark/rdd.py", line 757, in collect
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/home/hduser/spark-1.4.1-bin-without-hadoop/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop-05): org.apache.spark.SparkException: 
Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
  /usr/local/hadoop_store/tmp/nm-local-dir/usercache/hduser/filecache/16/spark-assembly-1.4.1-hadoop2.2.0.jar
java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
    at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:130)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/09/01 17:57:06 INFO spark.SparkContext: Invoking stop() from shutdown hook

Finally, this is my spark-env.sh conf file:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

Any idea about what I'm doing wrong?

765

asked Sep 01 '15 16:09

David Moreno García

1 Answers

What fixed this for me was including a couple of extra settings in the SparkConf, which seem to make sure the workers get access to the PySpark and Py4J modules:

conf = (SparkConf()
     .setMaster("yarn-client")
     .setAppName("HDFS Filter")
     .set("spark.executor.memory", "1g")
     .set('spark.yarn.dist.files','file:/usr/hdp/2.3.2.0-2950/spark/python/lib/pyspark.zip,file:/usr/hdp/2.3.2.0-2950/spark/python/lib/py4j-0.8.2.1-src.zip')
     .setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.8.2.1-src.zip'))

You'll need to edit the paths as appropriate for your system.

103

answered Oct 18 '22 14:10

tobycoleman

Related questions
                            
                                Extract a patch from an image given patch center and patch scale
                            
                                Pytest does not pick up test methods inside a class
                            
                                How can I sample a multivariate log-normal distribution in Python?
                            
                                Interactive slider to vary slice used in Bokeh image plot
                            
                                PyGtk - set checkbox in the treeview of a specific row invisible
                            
                                Python multiprocessing and an imported module
                            
                                converting a string to a tree structure in python
                            
                                How to add for each screen an own .py and .kv file?
                            
                                Firefox not receiving django csrf_token
                            
                                How to filter DeprecationWarnings that happen during importing?
                            
                                Which layout should I use to get non-overlapping edges in igraph in python?
                            
                                numpy array multiplication with arrays of arbitrary dimensions
                            
                                Sklearn joblib load function IO error from AWS S3
                            
                                Normalizing a list of restaurant dishes
                            
                                Is the char encoding same across programming languages?
                            
                                Check specific file has been modified using python watchdog
                            
                                Bokeh: pass vars to CustomJS for Widgets
                            
                                Generating random string of seedable data
                            
                                Python Requests encoding POST data
                            
                                Django REST Framework (DRF): TypeError: register() got an unexpected keyword argument 'base_name'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark module not found

Tags:

python

apache-spark

hadoop

pyspark

hadoop-yarn

David Moreno García

People also ask

1 Answers

tobycoleman

Recent Activity

Donate For Us