Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to speedup amazon EMR bootstrap?

Tags:

amazon-emr

I'm using amazon EMR for some intensive computation, but, it takes around 7 min to start computing, is there some clever way to have my computation starting immediately ? The computation is a python stream started from a user-faced website, so I can't really afford a long startup.

I might have simply missed an option in the ocean that is amazon AWS. I just want simplicity to launch jobs (that's what I used EMR), scalability, and pay only for what I use (and startup time is not useful).

like image 928
nraynaud Avatar asked May 23 '12 02:05

nraynaud


People also ask

How long does it take for EMR cluster to start?

We found that AWS Glue clusters have a cold start time of 10–12 minutes, whereas EMR clusters have a cold start time of 7–8 minutes.

What is bootstrapping in EMR?

Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.

Does AWS EMR scale automatically?

Amazon EMR automatically detects the need to scale up or down without specific cooldown periods. Auto Scaling allows you to define a fixed count of instances to add or remove in case of condition breach. You can choose to define custom application or infrastructure metrics.

What is normalized instance hours in EMR?

Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1. small usage = 1 hour normalized compute time. You can view our documentation to see a list of different sizes within an instance family, and the corresponding normalization factor per hour.


1 Answers

I know this is an old question but had some insights I would add to the next searcher who finds this thread in hope of speeding up bootstrap times on Amazon EMR.

For a while I have wondered why my clusters took so long to start, usually about 15 minutes. This takes a pretty big chunk of time for a job that usually completes in under 1 hour. Sometimes it pushes the job past 1 hour, but I think thankfully AWS does not charge for the full boot strap time.

The last couple days I noticed my startup times were improved. You see the spot market became very volatile during April and the first week of May. Normally, I start my cluster entirely of spot instances, as failure is an option, and the cost savings justifies the technique in my case. However, after waiting 14 hours for clusters to start, I had to switch to OnDemand, I only have so much patience, over night usually exceeds it. The OnDemand clusters start in about 5 minutes. Now having switched back to spot as the madness seems to have abated, I am back to 15 minutes for a cluster.

So if you are using Spot instances on your Core or Master nodes, expect a longer startup time. I will be experimenting with using a small set of OnDemand in the core and augmenting with a large number of spot instances to see if it helps startup and deals better with Spot Market volatility.

like image 52
AaronM Avatar answered Sep 20 '22 09:09

AaronM