Using the mrjob to run python code on Amazon's Elastic MapReduce I have successfully found a way to upgrade the EMR image's numpy and scipy.
Running from console the following commands work:
tar -cvf py_bundle.tar mymain.py Utils.py numpy-1.6.1.tar.gz scipy-0.9.0.tar.gz
gzip py_bundle.tar
python my_mapper.py -r emr --python-archive py_bundle.tar.gz --bootstrap-python-package numpy-1.6.1.tar.gz --bootstrap-python-package scipy-0.9.0.tar.gz > output.txt
This successfully bootstraps the latest numpy and scipy into the image and works perfectly. My question is a matter of speed. This takes 21 minutes to install itself on a small instance.
Does anyone have any idea how to speed up the process of upgrading numpy and scipy?
The only way to do anything to an EMR image is by using bootstrap actions. Doing this from the console means you'll only change the master node and not the task nodes which do the processing. Bootstrap actions run once at startup on all nodes and can be a simple script that gets shell exec'd.
elastic-mapreduce --create --bootstrap-action "s3://bucket/path/to/script" ...
To speed up changes to the EMR image, tar up the post-installed files and upload to S3. Then use a bootstrap action to download and deploy. You will have to keep separate archives for 32 bit (micro, small, medium) and 64 bit machines.
The command to download from S3 in the script is:
hadoop fs -get s3://bucket/path/to/archive /tmp/archive
The current answer to this question is that NumPy is already installed on EMR, now.
If you want to update NumPy to a more recent version than the one available, you can run a script (as a bootstrap action) that does sudo yum -y install numpy
. NumPy is then installed in no time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With