Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying python files for jupyter notebook on a Spark cluster

Tags:

I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:

import findspark
findspark.init()

import pyspark

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()

The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.

When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:

main.py

import pyspark
from myPackage import myFunc

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x)) 
b.take(10)
sc.stop()

Then

spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py

The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...

like image 403
ma3oun Avatar asked Dec 08 '17 17:12

ma3oun


People also ask

How do you connect Jupyter notebook to remote Spark clusters?

Start Jupyter Notebook from your OS or Anaconda menu or by running “jupyter notebook” from command line. It will open your default internet browser with Jupyter. Choose New, and then Spark or PySpark. The notebook will connect to Spark cluster to execute your commands.


1 Answers

I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error

It is camel case:

sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])

Or you can addPyFile

sc.addPyFile("myPackageSrcFiles.zip")
like image 61
Alper t. Turker Avatar answered Sep 19 '22 12:09

Alper t. Turker