Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shrinking AWS Lambda deployment package with CFLAGS and PIP to fit sklearn

I'm loading a pickled machine learning model with my Lambda handler so I need sklearn (I get "ModuleNotFoundError: No module named 'sklearn'" if it's not included)

So I created a new deployment package in Docker with sklearn.

But when I tried to upload the new lambda.zip file I could not save the lambda function. I get the error: Unzipped size must be smaller than 262144000 bytes

I did some googling and found two suggestions: (1) using CLFAG with PIP and (2) using Lambda Layers.

I don't think Layers will work. Moving parts of my deployment package to layers won't reduce the total size (and AWS documentation states " he total unzipped size of the function and all layers can't exceed the unzipped deployment package size limit of 250 MB".

CFLAGS sound promising but I've never worked with CFLAGS before and I'm getting errors.

I'm trying to add the flags: -Os -g0 -Wl,--strip-all

Pre-CLFAGS my docker pip command was: pip3 install requests pandas s3fs datetime bs4 sklearn -t ./

First I tried: pip3 install requests pandas s3fs datetime bs4 sklearn -t -Os -g0 -Wl,--strip-all ./

That produced errors of the variety "no such option: -g"

Then I tried CFLAGS = -Os -g0 -Wl,--strip-all pip3 install requests pandas s3fs datetime bs4 sklearn -t ./ and CFLAGS = -Os -g0 -Wl,--strip-all

But they produced the error "CFLAGS: command not found"

Can anyone help me understand how to use CFLAGS?

Also, I'm familiar with the saying "beggars can't be choosers" so any advice would be appreciated.

That said, I'm a bit of a noob so if you could help me with CFLAGS in the context of my Docker deployment package workflow it'd be most appreciated.

My docker workflow is:

  1. docker run -it olivierhervieu/amazonlinux-python36-onbuild
  2. mkdir deploy
  3. cd deploy
  4. pip3 install requests pandas s3fs datetime bs4 sklearn -t ./
  5. zip -r lambda.zip *
like image 511
tjc4 Avatar asked Mar 06 '20 18:03

tjc4


3 Answers

This kinda is an answer (I was able to shrink my deployment package and get my Lambda deployed) and kinda not an answer (I still don't know how to use CFLAGS).

A lot of googling eventually led me to this article which included a link to this list of modules that come pre-installed in the AWS Lambda Python environment.

My deployment package contained several of the modules that already exist in the AWS Lambda environment and thus do not need to be included in deployment packages.

The modules that saved the most space for me were Boto3 and Botocore. I didn't explicitly add these in my Docker environment but they made their way into my deployment package anyway (I'm guessing that S3FS depends on these modules and when installing S3FS they are also added).

I was also able to remove a lot of smaller modules (datetime, dateutil, docutils, six, etc). With these modules removed my package was under the 250mb limit and I was able to deploy.

Were I still not under the limit - I wasn't sure if that would be enough - I was going to try another suggestion from the linked article above: removing .py files from the deployment package (you don't need both .pyc and .py files).

Hope this helps with your Lambda deployment package size!

like image 139
tjc4 Avatar answered Sep 26 '22 05:09

tjc4


These days you would use Docker container for your lambda as its size can be 10 GB, which is far greater then traditional lambda functions deployed using deployment packages and layers. From AWS:

You can now package and deploy AWS Lambda functions as a container image of up to 10 GB.

Thus you could create a lambda container with sklearn plus any other files and dependencies that you require with the total size of 10 GB.

like image 33
Marcin Avatar answered Sep 22 '22 05:09

Marcin


We faced this exact problem ourselves but with Spacy rather than sklearn.

You're going about it the right way by looking at not deploying packages already included in the AWS deployment, but note sometimes this still won't get you under the limit (especially for ML purposes in which large models have to be included as part of the dependency).

In these instances, another option is to save any external static files (e.g. models etc.) which are used by the library in a private S3 bucket and then read them in at runtime. For example, as described by this answer..

Incidentally, If you're using the serverless framework to deploy your lambdas should check out the serverless-python-requirements plugin, which allows you to implement the steps you've described such as specifying packets not to deploy with the function and build 'slim' versions of the dependencies (automatically stripping out the .so files, pycache and dist-info directories as well as .pyc and .pyo files.)

Good luck :)

like image 34
mdmjsh Avatar answered Sep 25 '22 05:09

mdmjsh