Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark-submit fails to import SparkContext

I'm running Spark 1.4.1 on my local Mac laptop and am able to use pyspark interactively without any issues. Spark was installed through Homebrew and I'm using Anaconda Python. However, as soon as I try to use spark-submit, I get the following error:

15/09/04 08:51:09 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: Added file file:test.py does not exist.
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1329)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1305)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:458)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:458)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
15/09/04 08:51:09 ERROR SparkContext: Error stopping SparkContext after init error.
java.lang.NullPointerException
    at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1216)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1659)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:565)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
  File "test.py", line 35, in <module> sc = SparkContext("local","test") 
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 113, in __init__
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 165, in _do_init
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/pyspark.zip/pyspark/context.py", line 219, in _initialize_context
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: Added file file:test.py does not exist.
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1329)
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1305)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:458)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:458)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Here is my code:

from pyspark import SparkContext

if __name__ == "__main__":
    sc = SparkContext("local","test")
    sc.parallelize([1,2,3,4])
    sc.stop()

If I move the file to anywhere in the /usr/local/Cellar/apache-spark/1.4.1/ directory, then spark-submit works fine. I have my environment variables set as follows:

export SPARK_HOME="/usr/local/Cellar/apache-spark/1.4.1"
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip

I'm sure something is set incorrectly in my environment, but I can't seem to track it down.

like image 508
caleboverman Avatar asked Sep 04 '15 15:09

caleboverman


1 Answers

The python files that are executed by spark-submit should be on the PYTHONPATH. Either add the full path of the directory by doing:

export PYTHONPATH=full/path/to/dir:$PYTHONPATH

or you can also add '.' to the PYTHONPATH if you are already inside the directory where the python script is

export PYTHONPATH='.':$PYTHONPATH

Thanks to @Def_Os for pointing that out!

like image 76
Pieter Avatar answered Oct 31 '22 19:10

Pieter