Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Amazon EMR with a custom AMI?

I need to run a custom C++ job as a Map Reduce on Amazon, and was planning to use Hadoop streaming for this. The C++ mapper executable relies on dozens of custom libraries, some of which are time-consuming to build.

I expected EMR to support custom AMIs (already have one built). However, after a careful look at the documentation it seems that it is only possible to run EMR on predefined images: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html.

Am I missing something? If, indeed, only predefined AMIs are supported, what is the best option for getting this to run? The executable, obviously, is on s3, but can I actually bundle it up so that it depends on no shared libs at all?

Thanks.

like image 910
user2113258 Avatar asked Jan 07 '14 21:01

user2113258


People also ask

What is custom AMI in AWS?

An Amazon Machine Image (AMI) is a supported and maintained image provided by AWS that provides the information required to launch an instance. You must specify an AMI when you launch an instance. You can launch multiple instances from a single AMI when you require multiple instances with the same configuration.

What is the OS for AWS EMR?

Q: What OS versions are supported with Amazon EMR? Amazon EMR 5.30. 0 and later, and the Amazon EMR 6. x series are based on Amazon Linux 2.

Can we create our own AMI in AWS?

You can create an AMI using the AWS Management Console or the command line. The following diagram summarizes the process for creating an AMI from a running EC2 instance. Start with an existing AMI, launch an instance, customize it, create a new AMI from it, and finally launch an instance of your new AMI.


1 Answers

You are correct, because of the many software tools and configurations required on an Hadoop cluster node, only Amazon provided AMI are allowed on EMR. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html

You can use standard bootstrapping techniques to install any additional software you require to run on your cluster. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html to learn more about bootstrap actions.

Back to your use case : Why is it taking so long to bootstrap in your use case ? Because there are many packages ? Because you're compiling them from source ?

In the latter case, it might be worth to build your .deb packages and to install them from a custom repository to speed up bootstrap process.

If it just because you have many packages to install, I am afraid there is no obvious solution today. I can think about EBS snapshots and volumes being created and attached during bootstrap - but the feasibility of this really depends on your use case.

like image 193
Sébastien Stormacq Avatar answered Sep 25 '22 14:09

Sébastien Stormacq