Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip. e.g. in

gcloud beta dataproc jobs submit pyspark  --cluster clustername --py-files gcsuriofzip PY_FILE    

what should the value of PY_FILE be?

like image 802
bjorndv Avatar asked Sep 25 '15 15:09

bjorndv


1 Answers

This is a good question. To answer this question, I am going to use the PySpark wordcount example.

In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.

My test.py file looks like this:

import wordcount
import sys
if __name__ == "__main__":
    wordcount.wctest(sys.argv[1])

I modified the wordcount.py file to eliminate the main method and to add a named method:

...
from pyspark import SparkContext

...
def wctest(path):
    sc = SparkContext(appName="PythonWordCount")
...

I can call the whole thing on Dataproc by using the following gcloud command:

gcloud beta dataproc jobs submit pyspark  --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \ 
gs://<bucket>/input/input.txt

In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.

like image 97
James Avatar answered Sep 28 '22 07:09

James