How can I run two parallel jobs on Google Dataproc

1 Answers

Dataproc indeed supports multiple concurrent jobs. However, its ability to host multiple jobs is dependent on Yarn having available capacity to host Application Master (or the job will be queued) or the actual workers (or the job will take a long time).

The number of containers that your larger job will request is dependent on number of partitions. With default settings, a Dataproc worker will support 2 Mapper or Reducer tasks. If you're processing 100 files and each file is a partition your entire cluster capacity is now allocated.

There's a few things you could do:

Run smaller jobs on a separate cluster. Your ideal cluster configuration is when one job occupies the entire cluster, or N jobs evenly sharing the cluster
Add extra workers to your current cluster and/or experiment with preemptible workers (you can use clusters update command for resizing, see)
(Advanced) Experiment with different Yarn schedulers (see for Fair scheduler with queues)

answered Nov 03 '22 06:11

tix

Related questions
                            
                                Firebase Cloud Firestore unavailable in asia-south1 (Mumbai)?
                            
                                What are the pros and cons of loading data directly into Google BigQuery vs going through Cloud Storage first?
                            
                                How to fix errors when migrating Appengine app from old Google Plugin into Google Tools.?
                            
                                Stackdriver logging from node with stdout/stderr
                            
                                Migrate csv from gcs to postgresql
                            
                                Enable CORS with Google IAP
                            
                                ERROR: (gcloud.services.enable) User does not have permission to access project (or it may not exist): The caller does not have permission
                            
                                Trying to simulate cell level TTL in bigtable but whole column family data is getting removed by garbage collection
                            
                                Restricting user access for VM in gcp
                            
                                How to get URI of a blob in a google cloud storage (Python)
                            
                                Optimising GCP costs for a memory-intensive Dataflow Pipeline
                            
                                enabling CORS Google Cloud Function (Python)
                            
                                Difference between STOP instance and SUSPEND instance in Google Cloud Platform
                            
                                Accessing files in the Google Cloud Storage from two different google cloud projects
                            
                                How can I create cloud context.Context from appengine.Context
                            
                                How do I install Python libraries automatically on Dataproc cluster startup?
                            
                                How to update spark configuration after resizing worker nodes in Cloud Dataproc
                            
                                Locking entities to perform get-update-set operations in Google Cloud datastore
                            
                                How does Google Cloud Platform API key restriction for iOS apps work?
                            
                                Using Google CloudSQL, getting "connect ECONNREFUSED 127.0.0.1:3306"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I run two parallel jobs on Google Dataproc

Tags:

google-cloud-platform

google-cloud-dataproc

fbexiga

People also ask

1 Answers

tix

Recent Activity

Donate For Us