Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to bootstrap installation of Python modules on Amazon EMR?

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?

like image 636
Evan Zamir Avatar asked Jul 20 '15 19:07

Evan Zamir


People also ask

What is bootstrapping in EMR?

Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.

Can we run Python on EMR?

In most Amazon EMR release versions, cluster instances and system applications use different Python versions by default: Amazon EMR release versions 4.6. 0-5.19.

Does AWS use bootstrap?

The process of provisioning these initial resources is called bootstrapping. The required resources are defined in a AWS CloudFormation stack, called the bootstrap stack, which is usually named CDKToolkit . Like any AWS CloudFormation stack, it appears in the AWS CloudFormation console once it has been deployed.

Can we run PySpark on EMR?

You can use AWS Step Functions to run PySpark applications as EMR Steps on an existing EMR cluster. Using Step Functions, we can also create the cluster, run multiple EMR Steps sequentially or in parallel, and finally, auto-terminate the cluster.


2 Answers

The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.

Here's an example I'm using in production:

s3://mybucket/bootstrap/install_python_modules.sh

#!/bin/bash -xe  # Non-standard and non-Amazon Machine Image Python modules: sudo pip install -U \   awscli            \   boto              \   ciso8601          \   ujson             \   workalendar  sudo yum install -y python-psycopg2 
like image 105
noli Avatar answered Sep 22 '22 13:09

noli


In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':

from boto.emr.step import JarStep >>> pip_step=JarStep(name="Command Runner", ...             jar="command-runner.jar", ...             action_on_failure="CONTINUE", ...             step_args=['sudo','pip','install','arrow'] ... ) >>> spark_step=JarStep(name="Spark with Command Runner", ...                    jar="command-runner.jar", ...                    step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"] ...                    action_on_failure="CONTINUE" ) >>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step]) 

On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.

EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!

like image 45
Craig F Avatar answered Sep 19 '22 13:09

Craig F