Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I run two parallel jobs on Google Dataproc

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs.

From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it uses YARN dynamic allocation for resources.

However, when I try to do launch multiple jobs, they get queued and one doesn't start until the cluster is free.

All settings are by default. How can I enable multiple jobs running at the same time?

like image 677
fbexiga Avatar asked Feb 13 '17 14:02

fbexiga


People also ask

What types of jobs can be run on Google Dataproc?

What type of jobs can I run? Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.

What does cloud Dataproc do to clusters?

Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them.


1 Answers

Dataproc indeed supports multiple concurrent jobs. However, its ability to host multiple jobs is dependent on Yarn having available capacity to host Application Master (or the job will be queued) or the actual workers (or the job will take a long time).

The number of containers that your larger job will request is dependent on number of partitions. With default settings, a Dataproc worker will support 2 Mapper or Reducer tasks. If you're processing 100 files and each file is a partition your entire cluster capacity is now allocated.

There's a few things you could do:

  • Run smaller jobs on a separate cluster. Your ideal cluster configuration is when one job occupies the entire cluster, or N jobs evenly sharing the cluster

  • Add extra workers to your current cluster and/or experiment with preemptible workers (you can use clusters update command for resizing, see)

  • (Advanced) Experiment with different Yarn schedulers (see for Fair scheduler with queues)

like image 54
tix Avatar answered Nov 03 '22 06:11

tix