We've built a large python repo that uses lots of libraries (numpy, scipy, tensor flow, ...) And have managed these dependencies through a conda environment. Basically we have lots of developers contributing and anytime someone needs a new library for something they are working on they 'conda install' it.
Fast forward to today and now we need to deploy some applications that use our repo. We are deploying using docker, but are finding that these images are really large and causing some issues, e.g. 10+ GB. However each individual application only uses a subset of all the dependencies in the environment.yml.
Is there some easy strategy for dealing with this problem? In a sense I need to know the dependencies for each application, but I'm not sure how to do this in an automated way.
Any help here would be great. I'm new to this whole AWS, Docker, and python deployment thing... We're really a bunch of engineers and scientists who need to scale up our software. We have something that works, it just seems like there has to be a better way đ.
Most Docker images arenât built from scratch. Instead, you take an existing image and use it as the basis for your image using the FROM command in your Dockerfile: Docker has a series of âofficialâ Docker base images based on various Linux distributions, and also base images that package specific programming languages, in particular Python.
The image itself is large, but the theory is that these packages are installed via common image layers that other official Docker images will use, so overall disk usage will be low. Debian 11 slim variant.
Docker has a series of âofficialâ Docker base images based on various Linux distributions, and also base images that package specific programming languages, in particular Python.
A Dockerfile is a text document that contains the instructions to assemble a Docker image. When we tell Docker to build our image by executing the docker build command, Docker reads these instructions, executes them, and creates a Docker image as a result. Letâs walk through the process of creating a Dockerfile for our application.
First see if there are easy wins to shrink the image, like using Alpine Linux and being very careful about what gets installed with the OS package manager, and ensuring you only allow installing dependencies or recommended items when truly required, and that you clean up and delete artifacts like package lists, big things you may not need like Java, etc.
The base Anaconda/Ubuntu image is ~ 3.5GB in size, so it's not crazy that with a lot of extra installations of heavy third-party packages, you could get up to 10GB. In production image processing applications, I routinely worked with Docker images in the range of 3GB to 6GB, and those sizes were after we had heavily optimized the container.
To your question about splitting dependencies, you should provide each different application with its own package definition, basically a setup.py script and some other details, including dependencies listed in some mix of requirements.txt for pip and/or environment.yaml for conda.
If you have Project A in some folder / repo and Project B in another, you want people to easily be able to do something like pip install <GitHub URL to a version tag of Project A>
or conda env create -f ProjectB_environment.yml
or something, and voila, that application is installed.
Then when you deploy a specific application, have some CI tool like Jenkins build the container for that application using a FROM
line to start from your thin Alpine / whatever container, and only perform conda install or pip install for the dependency file for that project, and not all the others.
This also has the benefit that multiple different projects can declare different version dependencies even among the same set of libraries. Maybe Project A is ready to upgrade to the latest and greatest pandas version, but Project B needs some refactoring before the team wants to test that upgrade. This way, when CI builds the container for Project B, it will have a Python dependency file with one set of versions, while in Project A's folder or repo of source code, it might have something different.
There are many ways to tackle this problem:
Lean docker images - start with a very simple base image; and layer your images. See best practices for building images.
Specify individual app requirements using requirements.txt
files (make sure you pin your versions) and see specific instructions for conda.
Build and install "on demand"; when you do docker build, only install those requirements for the specific applications and not one giant image for every possible eventuality.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With