Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running steps of EMR in parallel

I am running a spark-job on EMR cluster,The issue i am facing is all the

EMR jobs triggered are executing in steps (in queue)

Is there any way to make them run parallel if not is there any alteration for that

like image 692
rahul Avatar asked Mar 30 '17 14:03

rahul


People also ask

How many EMR clusters can be run simultaneously?

Q: How many EMR clusters can be run simultaneously? Users may begin as many clusters as they wish. Users are limited to 20 instances across all of the clusters when we first start.

What is step function in EMR?

Each EMR step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster, including tools such as Apache Spark, Hive, or Presto.

What are the limitations of EMR cluster with multiple master nodes?

Limitations of an EMR cluster with multiple master nodes: If any two master nodes fail simultaneously, Amazon EMR cannot recover the cluster. Amazon EMR clusters with multiple master nodes are not tolerant to Availability Zone failures.


3 Answers

Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.

Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.

YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.

like image 133
ma.tome Avatar answered Oct 09 '22 22:10

ma.tome


If you are concerned about YARN jobs running in a queue(submitted by spark)..

There are multiple solutions to run jobs in parallel ,

By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.

For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)

Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.

Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,

  1. Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.

  2. You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.

Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.

https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html

like image 2
jc mannem Avatar answered Oct 09 '22 20:10

jc mannem


Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.

Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.

like image 2
Paul Codding Avatar answered Oct 09 '22 22:10

Paul Codding