I need to run a custom C++ job as a Map Reduce on Amazon, and was planning to use Hadoop streaming for this. The C++ mapper executable relies on dozens of custom libraries, some of which are time-consuming to build.
I expected EMR to support custom AMIs (already have one built). However, after a careful look at the documentation it seems that it is only possible to run EMR on predefined images: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html.
Am I missing something? If, indeed, only predefined AMIs are supported, what is the best option for getting this to run? The executable, obviously, is on s3, but can I actually bundle it up so that it depends on no shared libs at all?
Thanks.
An Amazon Machine Image (AMI) is a supported and maintained image provided by AWS that provides the information required to launch an instance. You must specify an AMI when you launch an instance. You can launch multiple instances from a single AMI when you require multiple instances with the same configuration.
Q: What OS versions are supported with Amazon EMR? Amazon EMR 5.30. 0 and later, and the Amazon EMR 6. x series are based on Amazon Linux 2.
You can create an AMI using the AWS Management Console or the command line. The following diagram summarizes the process for creating an AMI from a running EC2 instance. Start with an existing AMI, launch an instance, customize it, create a new AMI from it, and finally launch an instance of your new AMI.
You are correct, because of the many software tools and configurations required on an Hadoop cluster node, only Amazon provided AMI are allowed on EMR. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
You can use standard bootstrapping techniques to install any additional software you require to run on your cluster. See http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html to learn more about bootstrap actions.
Back to your use case : Why is it taking so long to bootstrap in your use case ? Because there are many packages ? Because you're compiling them from source ?
In the latter case, it might be worth to build your .deb packages and to install them from a custom repository to speed up bootstrap process.
If it just because you have many packages to install, I am afraid there is no obvious solution today. I can think about EBS snapshots and volumes being created and attached during bootstrap - but the feasibility of this really depends on your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With