Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

boto EMR add step and auto terminate

Python 2.7.12

boto3==1.3.1

How can I add a step to a running EMR cluster and have the cluster terminated after the step is complete, regardless of it fails or succeeds?

Create the cluster

response = client.run_job_flow(
    Name=name,
    LogUri='s3://mybucket/emr/',
    ReleaseLabel='emr-5.9.0',
    Instances={
        'MasterInstanceType': instance_type,
        'SlaveInstanceType': instance_type,
        'InstanceCount': instance_count,
        'KeepJobFlowAliveWhenNoSteps': True,
        'Ec2KeyName': 'KeyPair',
        'EmrManagedSlaveSecurityGroup': 'sg-1234',
        'EmrManagedMasterSecurityGroup': 'sg-1234',
        'Ec2SubnetId': 'subnet-1q234',
    },
    Applications=[
        {'Name': 'Spark'},
        {'Name': 'Hadoop'}
    ],
    BootstrapActions=[
        {
            'Name': 'Install Python packages',
            'ScriptBootstrapAction': {
                'Path': 's3://mybucket/code/spark/bootstrap_spark_cluster.sh'
            }
        }
    ],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    Configurations=[
        {
            'Classification': 'spark',
            'Properties': {
                'maximizeResourceAllocation': 'true'
            }
        },
    ],
)

Add a step

response = client.add_job_flow_steps(
    JobFlowId=cluster_id,
    Steps=[
        {
            'Name': 'Run Step',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Args': [
                    'spark-submit',
                    '--deploy-mode', 'cluster',
                    '--py-files',
                    's3://mybucket/code/spark/spark_udfs.py',
                    's3://mybucket/code/spark/{}'.format(spark_script),
                    '--some-arg'
                ],
                'Jar': 'command-runner.jar'
            }
        }
    ]
)

This successfully adds a step and runs, however, when the step completes successfully, I would like the cluster to auto-terminate as noted in the AWS CLI: http://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html

like image 720
duffn Avatar asked Oct 24 '17 12:10

duffn


1 Answers

In your case (creating the cluster using boto3) you can add these flags 'TerminationProtected': False, 'AutoTerminate': True, to your cluster creation. In this way after your step finished to run the cluster will be shut-down.

Another solution is to add another step to kill the cluster immediately after the step that you want to run. So basically you need to run this command as step

aws emr terminate-clusters --cluster-ids your_cluster_id

The tricky part is to retrive the cluster_id. Here you can find some solution: Does an EMR master node know it's cluster id?

like image 196
nicor88 Avatar answered Sep 18 '22 21:09

nicor88