I have a use-case to run Spark job periodically (say, 30 minutes) on a EMR cluster. What are the factors to decide whether to have a new cluster for every run or use a long running cluster?
What are possible strategies for scaling up the cluster if we decide on a long running cluster?
I generally prefer independent clusters because it makes it easier to debug and spawn off test runs when needed. But, you would want to do the math of how much it would cost you in both scenarios. Adding more nodes later to an existing cluster is easy, so I wouldn't worry about that.
Things to know:
The things you would want to consider:
The cost will based on what EC2 instance type you select for your cluster and how many nodes you decide to have. An easy way to compute the estimates is to use AWS's cost calculator:
https://calculator.s3.amazonaws.com/index.html
For your case, it depends on how long your spark job takes to run. You pay for the cluster on one minute increments so if your job only takes a few minutes to run then it will be cheaper create a new cluster each time. The other thing to remember is it usually takes around 10 minutes or an EMR cluster to start, which is time you are paying for so even if your job only takes 5 minutes you would pay for
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With