Long running EMR cluster vs new cluster for each occurrence

Question

I have a use-case to run Spark job periodically (say, 30 minutes) on a EMR cluster. What are the factors to decide whether to have a new cluster for every run or use a long running cluster?

What are possible strategies for scaling up the cluster if we decide on a long running cluster?

Ryan Widmaier · Accepted Answer

I generally prefer independent clusters because it makes it easier to debug and spawn off test runs when needed. But, you would want to do the math of how much it would cost you in both scenarios. Adding more nodes later to an existing cluster is easy, so I wouldn't worry about that.

Things to know:

You will pay rounded to the nearest minute
EMR clusters take about 10 minutes to start, which is time you are paying for

The things you would want to consider:

How long does your job actually take to run.
Is a 10 minute delay to starting your job acceptable?
If your job is < 20 minutes: It will be cheaper to do independent clusters
If your job is > 30 minutes: On a persistent cluster your next half hour job would have to wait
Do you want isolation of your runs? If you run separate clusters, when you are reading logs to debug you won't have to worry about filtering out different jobs
If you use a persistent cluster, you can manually setup any extra dependencies since you are only going to do it once. On new clusters you would want to script it.

The cost will based on what EC2 instance type you select for your cluster and how many nodes you decide to have. An easy way to compute the estimates is to use AWS's cost calculator:

https://calculator.s3.amazonaws.com/index.html

For your case, it depends on how long your spark job takes to run. You pay for the cluster on one minute increments so if your job only takes a few minutes to run then it will be cheaper create a new cluster each time. The other thing to remember is it usually takes around 10 minutes or an EMR cluster to start, which is time you are paying for so even if your job only takes 5 minutes you would pay for

Long running EMR cluster vs new cluster for each occurrence

Tags:

apache-spark

amazon-emr

Abhay Dubey

1 Answers

Ryan Widmaier

Recent Activity

Donate For Us

Long running EMR cluster vs new cluster for each occurrence

Tags:

apache-spark

amazon-emr

Abhay Dubey

1 Answers

Ryan Widmaier

Related questions

Recent Activity

Donate For Us