Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to submit multiple spark jobs to single AWS EMR cluster

I am trying to submit multiple jobs to the EMR cluster but I see only the first one in running state and rest all are in Accepted state. The majority of my jobs are streaming Jobs.

I have the following queries:

  1. How can I achieve the parallel run of these jobs?
  2. What are the various ways to automate these jobs for future deployment?
  3. How can I handle scheduled jobs (like a job running once every 15 minutes)?

I am using Java for development. Any inputs will be really helpful.

like image 469
Ankur Gogate Avatar asked Jul 18 '20 21:07

Ankur Gogate


People also ask

How do I submit Spark jobs to EMR cluster?

To submit a Spark step using the consoleOpen the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . In the Cluster List, choose the name of your cluster. Scroll to the Steps section and expand it, then choose Add step.

How do I run multiple Spark jobs in parallel?

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save , collect ) and any tasks that need to run to evaluate that action.

How many master instances does AWS EMR allow in a cluster?

You can start as many clusters as you like. When you get started, you are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form.


1 Answers

If the multiple steps in the EMR are not dependent on each other, then you can use the feature called Concurrency in the EMR to solve your use case. This feature simply means that you can run more than 1 step in parallel at a time.

This feature is there from the EMR version 5.28.0. If you are using the older version than this then you can not use this feature.

While launching the EMR from the AWS console, this feature is termed as 'Concurrency' in the UI. you can choose any number between 1 to 256.

If you are launching the EMR from the AWS CLI, then this feature is termed as 'StepConcurrencyLevel'.

You can read more about this at multiple steps now in EMR and AWS CLI details

To answer your second question about how can I handle schedule jobs?

There are multiple ways to do this. One simplistic way which I can think of is to write a lambda function that spawns this EMR. Now, this lambda function can be scheduled in AWS cloudwatch to run at any frequency that you want (say every 15 minutes or any time interval). You just need to mention a Cron expression which will decide by which frequency this rule would be triggered.

So every time the rule gets triggered, it will execute your lambda function. And your lambda function in turn would spawn the EMR. In this way you can schedule your jobs.

like image 125
Ajay Kr Choudhary Avatar answered Sep 24 '22 16:09

Ajay Kr Choudhary