Submit a PySpark job to a cluster with the '--py-files' argument

Question

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip. e.g. in

gcloud beta dataproc jobs submit pyspark  --cluster clustername --py-files gcsuriofzip PY_FILE

what should the value of PY_FILE be?

gcloud beta dataproc jobs submit pyspark  --cluster clustername --py-files gcsuriofzip PY_FILE

what should the value of PY_FILE be?

James · Accepted Answer

This is a good question. To answer this question, I am going to use the PySpark wordcount example.

In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.

My test.py file looks like this:

import wordcount
import sys
if __name__ == "__main__":
    wordcount.wctest(sys.argv[1])

I modified the wordcount.py file to eliminate the main method and to add a named method:

...
from pyspark import SparkContext

...
def wctest(path):
    sc = SparkContext(appName="PythonWordCount")
...

I can call the whole thing on Dataproc by using the following gcloud command:

gcloud beta dataproc jobs submit pyspark  --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \ 
gs://<bucket>/input/input.txt

In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.

Submit a PySpark job to a cluster with the '--py-files' argument

Tags:

google-cloud-dataproc

bjorndv

1 Answers

James

Recent Activity

Donate For Us

Submit a PySpark job to a cluster with the '--py-files' argument

Tags:

google-cloud-dataproc

bjorndv

1 Answers

James

Related questions

Recent Activity

Donate For Us