I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()
The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.
When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:
main.py
import pyspark
from myPackage import myFunc
sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x))
b.take(10)
sc.stop()
Then
spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py
The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...
Start Jupyter Notebook from your OS or Anaconda menu or by running “jupyter notebook” from command line. It will open your default internet browser with Jupyter. Choose New, and then Spark or PySpark. The notebook will connect to Spark cluster to execute your commands.
I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error
It is camel case:
sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])
Or you can addPyFile
sc.addPyFile("myPackageSrcFiles.zip")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With