Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python client support for running Hive on top of Amazon EMR

I've noticed that neither mrjob nor boto supports a Python interface to submit and run Hive jobs on Amazon Elastic MapReduce (EMR). Are there any other Python client libraries that supports running Hive on EMR?

like image 671
poiuy Avatar asked May 23 '11 22:05

poiuy


People also ask

Can we use hive in AWS?

Apache Hive is natively supported in Amazon EMR, and you can quickly and easily create managed Apache Hive clusters from the AWS Management Console, AWS CLI, or the Amazon EMR API.

Can we run PySpark on EMR?

You can use AWS Step Functions to run PySpark applications as EMR Steps on an existing EMR cluster. Using Step Functions, we can also create the cluster, run multiple EMR Steps sequentially or in parallel, and finally, auto-terminate the cluster.

Which options are commonly used processing frameworks for Amazon EMR?

Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto.


1 Answers

With boto you can do something like this:

args1 = [u's3://us-east-1.elasticmapreduce/libs/hive/hive-script',
         u'--base-path',
         u's3://us-east-1.elasticmapreduce/libs/hive/',
         u'--install-hive',
         u'--hive-versions',
         u'0.7']
args2 = [u's3://us-east-1.elasticmapreduce/libs/hive/hive-script',
         u'--base-path',
         u's3://us-east-1.elasticmapreduce/libs/hive/',
         u'--hive-versions',
         u'0.7',
         u'--run-hive-script',
         u'--args',
         u'-f',
         s3_query_file_uri]
steps = []
for name, args in zip(('Setup Hive','Run Hive Script'),(args1,args2)):
    step = JarStep(name,
                   's3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar',
                   step_args=args,
                   #action_on_failure="CANCEL_AND_WAIT"
                   )
    #should be inside loop
    steps.append(step)
# Kick off the job
jobid = EmrConnection().run_jobflow(name, s3_log_uri,
                                   steps=steps,
                                   master_instance_type=master_instance_type,
                                   slave_instance_type=slave_instance_type,
                                   num_instances=num_instances,
                                   hadoop_version="0.20")
like image 102
unthingable Avatar answered Nov 15 '22 19:11

unthingable