Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to install custom packages on amazon EMR bootstrap action in code?

need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this.

Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto.

                      name='Image to grayscale using SimpleCV python package',
                      mapper='s3n://elasticmapreduce/samples/imageGrayScale.py',
                      reducer='aggregate',
                      input='s3n://elasticmapreduce/samples/input',
                      output='s3n://<my output bucket>/output'

I need to make it use the SimpleCV python package, but not sure where to specify this. What if it is not installed, how do I make it installed? Is there a way to avoid waiting for the installation to complete, is it possible to install it somewhere and just reference the python package?

like image 780
KJW Avatar asked Apr 19 '14 10:04

KJW


1 Answers

There is a class boto.emr.bootstrap_action.BootstrapAction for the bootstrap action.

Define it like the below. Most of the code is from the boto example page.

import boto.emr
from boto.emr.bootstrap_action import BootstrapAction

action = BootstrapAction(name="Bootstrap to add SimpleCV",
                         path="s3n://<my bucket uri>/bootstrap-simplecv.sh")

conn = boto.emr.connect_to_region('us-west-2')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step],  # step defined elsewhere
                         bootstrap_actions=[action])

And you need to define the bootstrap action. If you need another version of Python then yes, it would save time to precompile it on the exact same computer, tar it, put it in an S3 bucket, and then untar it during the bootstrap.

#!/bin/sh
# filename: bootstrap-simplecv.sh  (save it in an S3 bucket)
set -e -x

sudo apt-get install python-setuptools
sudo easy_install pip 
sudo pip install -U SimpleCV

I think you can leave EMR instances spinning from within boto so that the bootstrap only occurs the first time in your session. Just be careful to shut them down before you log out so you don't get a surprise on your bill.

like image 109
slushy Avatar answered Nov 01 '22 12:11

slushy