Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to launch and configure an EMR cluster using boto

Tags:

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

  1. How to define the cluster to be used (by clusted_id)
  2. How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?

like image 877
eran Avatar asked Oct 11 '14 11:10

eran


1 Answers

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python  import boto import boto.emr from boto.emr.instance_group import InstanceGroup  conn = boto.emr.connect_to_region('us-east-1') 

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = [] instance_groups.append(InstanceGroup(     num_instances=1,     role="MASTER",     type="m1.small",     market="ON_DEMAND",     name="Main node")) instance_groups.append(InstanceGroup(     num_instances=2,     role="CORE",     type="m1.small",     market="ON_DEMAND",     name="Worker nodes")) instance_groups.append(InstanceGroup(     num_instances=2,     role="TASK",     type="m1.small",     market="SPOT",     name="My cheap spot nodes",     bidprice="0.002")) 

Finally we start a new cluster:

cluster_id = conn.run_jobflow(     "Name for my cluster",     instance_groups=instance_groups,     action_on_failure='TERMINATE_JOB_FLOW',     keep_alive=True,     enable_debugging=True,     log_uri="s3://mybucket/logs/",     hadoop_version=None,     ami_version="2.4.9",     steps=[],     bootstrap_actions=[],     ec2_keyname="my-ec2-key",     visible_to_all_users=True,     job_flow_role="EMR_EC2_DefaultRole",     service_role="EMR_DefaultRole") 

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id 
like image 169
Vilsepi Avatar answered Sep 24 '22 00:09

Vilsepi