How can I programmatically shutdown a Google Dataproc cluster automatically after all jobs have completed?
Dataproc provides creation, monitoring and management. But it seems I cannot find out how to delete the cluster.
The gcloud dataproc
CLI interface offers the max-idle
option.
This automatically kills the Dataproc cluster after an x amount of inactivity (i.e. no running jobs). It can be used as follows:
gcloud dataproc clusters create test-cluster \
--project my-test-project \
--zone europe-west1-b \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 100 \
--max-idle 1h
It depends on the language. Personally, I use Python (pyspark) and the code provided here worked fine for me:
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/dataproc/submit_job_to_cluster.py
You may need to adapt the code to your purpose and follow the prerequisite steps specified in the README file (https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/dataproc), like enabling the API and installing the packages in requirements.txt
.
Basically, using the function wait_for_job
you wait until the job has finished, and with delete_cluster
, as the name says, the cluster that you have previously created get deleted.
I hope this can help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With