Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to turn pip / pypi installed python packages into zip files to be used in AWS Glue

I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery as a part of my PySpark scripts.

The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.

So the goal is to turn the pip installed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so:

s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip

How that should be done is not clearly stated anywhere I've looked.

i.e. how do I pip install a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?

By using the command pip download I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz

..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf them and then zip them back up, but how about whl files?

like image 222
herb Avatar asked Mar 07 '18 12:03

herb


1 Answers

So, after going through the materials I sourced in the comments over the past 48 hours, here's how I solved the issue.

Note: I use Python2.7 because that's what AWS Glue seems to ship with.

By following the instructions in E. Kampf's blog post "Best Practices Writing Production-Grade PySpark Jobs" and this stack overflow answer, and some tweaking due to random errors along the way, I did the following:

  1. Create a new project folder called ziplib and cd into it:

mkdir ziplib && cd ziplib

  1. Create a requirements.txt file with names of packages on each row.

  2. Create a folder in it called deps:

mkdir deps

  1. Create a new virtualenv environment with python 2.7 in the current folder:

virtualenv -p python2.7 .

  1. Install the requirements into the folder deps, using ABSOLUTE paths (otherwise won't work):

bin/pip2.7 install -r requirements.txt --install-option --install-lib="/absolute/path/to/.../ziplib/deps"

  1. cd into the deps folder and zip its contents into zip archive deps.zip in parent folder, and then cd out of the deps folder:

cd deps && zip -r ../deps.zip . && cd ..

..and so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work.

HOWEVER... what I haven't been able to solve is that since some packages, such as the Google Cloud Python client libs, use what is known as Implicit Namespace Packages (PEP-420), they don't have the __init__.py files usually present in modules, and thus the import statements don't work. I'm at a loss here.

like image 123
herb Avatar answered Nov 02 '22 00:11

herb