I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip. e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?
This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py
which is the file I want to execute and another called wordcount.py.zip
which is a zip containing a modified wordcount.py
file designed to mimic a module I want to call.
My test.py
file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py
file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud
command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket>
is the name (or path) to my bucket and <cluster-name>
is the name of my Dataproc cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With