I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs.
From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it uses YARN dynamic allocation for resources.
However, when I try to do launch multiple jobs, they get queued and one doesn't start until the cluster is free.
All settings are by default. How can I enable multiple jobs running at the same time?
What type of jobs can I run? Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.
Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them.
Dataproc indeed supports multiple concurrent jobs. However, its ability to host multiple jobs is dependent on Yarn having available capacity to host Application Master (or the job will be queued) or the actual workers (or the job will take a long time).
The number of containers that your larger job will request is dependent on number of partitions. With default settings, a Dataproc worker will support 2 Mapper or Reducer tasks. If you're processing 100 files and each file is a partition your entire cluster capacity is now allocated.
There's a few things you could do:
Run smaller jobs on a separate cluster. Your ideal cluster configuration is when one job occupies the entire cluster, or N jobs evenly sharing the cluster
Add extra workers to your current cluster and/or experiment with preemptible workers (you can use clusters update
command for resizing, see)
(Advanced) Experiment with different Yarn schedulers (see for Fair scheduler with queues)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With