Scheduling cron jobs on Google Cloud DataProc

Tags:

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.

What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.

Thanks in advance!

238

asked Nov 18 '19 11:11

Alabhya Mishra

1 Answers

For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.

If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.

Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:

I created a workflow in my project called terasort-example.

I then created a new Service Account in my project, called [email protected] and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.

After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:

Target: HTTP

URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json

HTTP Method: POST

Body: {}

Auth Header: OAuth Token

Service Account: [email protected]

Scope: (left blank)

You can test it by clicking Run Now.

Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json

Check out this official doc that discusses other scheduling options.

137

answered Sep 26 '22 05:09

tix

Related questions
                            
                                Is it possible to use Google Cloud CDN with App Engine Standard environment?
                            
                                How to save receipt of Apple in-app Purchase in database?
                            
                                Dialogflow send custom payload json from webhook
                            
                                Process cannot access the file because it is being used by another process [Android] [Gradle]
                            
                                Firestore Security Rules for Query with Array Contains
                            
                                Why is Firestore rounding 64 bit integers?
                            
                                How do I correlate request logs in Cloud Run?
                            
                                How do I set up a Google Analytics Custom Dimension as a dimension filter in Data Studio report?
                            
                                Angular Firebase Function Deploy Error: Cannot find module 'firebase/app'
                            
                                ER_ACCESS_DENIED_ERROR CloudSQL
                            
                                firebase function Error: Cannot find module when serving locally, out of the blue on previously working project
                            
                                iOS App running on iOS 13 is resetting each time the user goes into background for at least 30sec
                            
                                Firebase hosting caches Google Cloud Run requests
                            
                                Cloud build permission denied when deploy to cloud run with "--set-sql-instance" argument
                            
                                Is it possible to increase the response timeout in Google App Engine?
                            
                                How to Install git in google compute engine?
                            
                                Send notification automatically from Firebase
                            
                                What is the meaning of Google Translate query params?
                            
                                Does firebase hosting benefit from CloudFlare?
                            
                                Read data from cloud firestore with firebase cloud function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scheduling cron jobs on Google Cloud DataProc

Tags:

cron

google-cloud-platform

google-cloud-dataproc

google-cloud-scheduler

Alabhya Mishra

People also ask

1 Answers

tix

Recent Activity

Donate For Us