I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem: <pre class="prettyprint"><code>import findspark findspark.init() import pyspark sc = pyspark.SparkContext(appName='myApp') a = sc.range(1000,numSlices=10) a.take(10) sc.stop() </code></pre> The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers. When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly: main.py <pre class="prettyprint"><code>import pyspark from myPackage import myFunc sc = pyspark.SparkContext(appName='myApp') a = sc.range(1000,numSlices=10) b = a.map(lambda x: myFunc(x)) b.take(10) sc.stop() </code></pre> Then <pre class="prettyprint"><code>spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py </code></pre> The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...

<blockquote> I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error </blockquote> It is camel case: <pre class="prettyprint"><code>sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"]) </code></pre> Or you can <code>addPyFile</code> <pre class="prettyprint"><code>sc.addPyFile("myPackageSrcFiles.zip") </code></pre>

Specifying python files for jupyter notebook on a Spark cluster

Tags:

I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:

import findspark
findspark.init()

import pyspark

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()

The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.

When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:

main.py

import pyspark
from myPackage import myFunc

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x)) 
b.take(10)
sc.stop()

Then

spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py

The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...

403

asked Dec 08 '17 17:12

ma3oun

1 Answers

I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error

It is camel case:

sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])

Or you can addPyFile

sc.addPyFile("myPackageSrcFiles.zip")

answered Sep 19 '22 12:09

Alper t. Turker

Related questions
                            
                                Modifying the weights and biases of a restored CNN model in TensorFlow
                            
                                Push QML ChartView updates from c++
                            
                                Undo set-branches on git remote
                            
                                Wordpress "admin-ajax.php" 404 Error
                            
                                Why has the .bss segment not increased when variables are added?
                            
                                How to mark specific combinations of parametrized pytest arguments?
                            
                                Spring AOP - Pointcut Based on Value from Properties File
                            
                                VS2017 renaming projects - the folder already contains an item named
                            
                                influxdb proxy or influxdb fetches data from other influxdb
                            
                                Java 8 Stream Filter returning an empty list always [duplicate]
                            
                                Decorator Pattern, via inheritance or dependency injection?
                            
                                How to serialize dynamic object to xml c#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With