How to bootstrap installation of Python modules on Amazon EMR?

Tags:

I want to do something really basic, simply fire up a Spark cluster through the EMR console and run a Spark script that depends on a Python package (for example, Arrow). What is the most straightforward way of doing this?

636

asked Jul 20 '15 19:07

Evan Zamir

2 Answers

The most straightforward way would be to create a bash script containing your installation commands, copy it to S3, and set a bootstrap action from the console to point to your script.

Here's an example I'm using in production:

s3://mybucket/bootstrap/install_python_modules.sh

#!/bin/bash -xe  # Non-standard and non-Amazon Machine Image Python modules: sudo pip install -U \   awscli            \   boto              \   ciso8601          \   ujson             \   workalendar  sudo yum install -y python-psycopg2

105

answered Sep 22 '22 13:09

noli

In short, there are two ways to install packages with pip, depending on the platform. First, you install whatever you need and then you can run your Spark step. The easiest is to use emr-4.0.0 and 'command-runner.jar':

from boto.emr.step import JarStep >>> pip_step=JarStep(name="Command Runner", ...             jar="command-runner.jar", ...             action_on_failure="CONTINUE", ...             step_args=['sudo','pip','install','arrow'] ... ) >>> spark_step=JarStep(name="Spark with Command Runner", ...                    jar="command-runner.jar", ...                    step_args=["spark-submit","/usr/lib/spark/examples/src/main/python/pi.py"] ...                    action_on_failure="CONTINUE" ) >>> step_list=conn.add_jobflow_steps(emr.jobflowid, [pip_step,spark_step])

On 2.x and 3.x, you use script-runner.jar in a similar fashion except that you have to specify the full URI for scriptrunner.

EDIT: Sorry, I didn't see that you wanted to do this through console. You can add the same steps in the console as well. The first step would be a Customer JAR with the same args as above. The second step is a spark step. Hope this helps!

answered Sep 19 '22 13:09

Craig F

Related questions
                            
                                How to detect lowercase letters in Python?
                            
                                python logging module is not writing anything to file
                            
                                Is there a Python equivalent for Scala's Option or Either?
                            
                                How to use numpy.void type
                            
                                PyTorch memory model: "torch.from_numpy()" vs "torch.Tensor()"
                            
                                Multiple mod_wsgi apps on one virtual host directing to wrong app
                            
                                Function not changing global variable
                            
                                Finding the indices of matching elements in list in Python
                            
                                Passing list-likes to .loc or [] with any missing labels is no longer supported
                            
                                how to test if one python module has been imported?
                            
                                Zipping lists of unequal size
                            
                                matplotlib imshow - default colour normalisation
                            
                                How Do I Use Raw Socket in Python?
                            
                                What's the reverse of shlex.split?
                            
                                How can I convert surrogate pairs to normal string in Python?
                            
                                Check if geo-point is inside or outside of polygon
                            
                                Parsing datetime in Python..?
                            
                                How does sklearn.svm.svc's function predict_proba() work internally?
                            
                                How to 'turn off' blurry effect of imshow() in matplotlib?
                            
                                How to add different graphs (as an inset) in another python graph [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to bootstrap installation of Python modules on Amazon EMR?

Tags:

python

amazon-web-services

apache-spark

emr

Evan Zamir

People also ask

2 Answers

noli

Craig F

Recent Activity

Donate For Us