Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boto3 EMR - Hive step

Is it possible to carry out hive steps using boto 3? I have been doing so using AWS CLI, but from the docs (http://boto3.readthedocs.org/en/latest/reference/services/emr.html#EMR.Client.add_job_flow_steps), it seems like only jars are accepted. If Hive steps are possible, where are the resources?

Thanks

like image 251
intl Avatar asked Sep 05 '15 06:09

intl


People also ask

What is an EMR step?

You can submit one or more ordered steps to an Amazon EMR cluster. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.

What is EMR in hive?

Apache Hive is an open-source, distributed, fault-tolerant system that provides data warehouse-like query capabilities. It enables users to read, write, and manage petabytes of data using a SQL-like interface.

What is EMR serverless?

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.

How do I find my AWS cluster name?

View cluster status using the AWS CLI You can use the describe-cluster command to view cluster-level details including status, hardware and software configuration, VPC settings, bootstrap actions, instance groups, and so on. For more information about cluster states, see Understanding the cluster lifecycle.


2 Answers

I was able to get this to work using Boto3:

# First create your hive command line arguments
hive_args = "hive -v -f s3://user/hadoop/hive.hql"

# Split the hive args to a list
hive_args_list = hive_args.split()

# Initialize your Hive Step 
hiveEmrStep=[
        {
            'Name': 'Hive_EMR_Step',
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': hive_args_list
            }
        },
    ]

# Create Boto3 session and client
session = boto3.Session(region_name=AWS_REGION,profile_name=AWS_PROFILE)
client = session.client('emr')

# Submit and execute EMR Step
client.add_job_flow_steps(JobFlowId=cluster_id,Steps=hiveEmrStep)

#Where cluster_id is the ID of your cluster from AWS EMR (ex: j-2GS7xxxxxx)
like image 120
wt_tsu_13 Avatar answered Sep 28 '22 13:09

wt_tsu_13


In the previous version of Boto, there was a helper class named HiveStep which made it easy to construct the a job flow step for executing a Hive job. However in Boto3, the approach has changed and the classes are generated at run-time from the AWS REST API. As a result, no such helper class exists. Looking at the source code of HiveStep, https://github.com/boto/boto/blob/2d7796a625f9596cbadb7d00c0198e5ed84631ed/boto/emr/step.py it can be seen that this is a subclass of Step, which is a class with properties jar args and mainclass, very similar to the requirments in Boto3.

It turns out, all job flow steps on EMR, including Hive ones, still need to be instantiated from a JAR. Therefore you can execute Hive steps through Boto3, but there is no helper class to make it easy to construct the definition.

By looking at the approach used by HiveStep in the previous version of Boto, you could construct a valid job flow definition.

Or, you could fall back to using the previous version of Boto.

like image 41
mattinbits Avatar answered Sep 28 '22 13:09

mattinbits