Every time I create a new project, it copies the entire package codebase into the project folder, which I find wasteful. For example, I don't want each of my projects to occupy 1 gigabyte worth of Tensorflow v2.8 disk space.
In other languages, we can easily avoid this. For instance, in Node.js, we can use pnpm or yarn berry. In Dart, this functionality is built-in by default. They utilize a global cache directory and the engine refers directly to ~/cache/.some-language/some-package/version/files (or its symlink in pnpm).
However, I can't seem to find a way to do this in Python. I read about all the dozens of Python package manager and I tried using uv because they advertise on their GitHub page:
💾 Disk-space efficient, with a global cache for dependency deduplication.
Unfortunately, it turns out that this was completely false and misleading. I tried it, and it only caches the package to reduce network usage the next time it is installed, but it still copies an entire instance into each project's .venv folder. This does not improve disk space usage at all.
For example:
myproject/.venv/site-packages % du -sh * | sort -hr | head -20
1.0G tensorflow
326M jaxlib
106M mediapipe
106M cv2
81M scipy
72M clang
46M numpy
no-go answers:
It's a valid pain. I follow this approch.
The fundamental reason is Python's import system and how packages are installed. Python packages often contain compiled extensions and platform-specific code, making hard linking or sophisticated caching more challenging.
One way to reduce space is use docker
Create a base image
FROM python:3.11-slim as base
RUN pip install tensorflow numpy scipy
In project image
FROM base
COPY requirements.txt .
RUN pip install -r requirements.txt
Docker approch helping us to overcome plaform dependencies and alos provides layer caching.
You need to maintain a global packages and install the lib's on it.
pip install --target ~/.python-global-packages tensorflow scipy numpy
Whenever i create a venv
cd project/.venv/lib/python3.11/site-packages
I use to do symlinks like so
ln -s ~/.python-global-packages/tensorflow* .
ln -s ~/.python-global-packages/scipy* .
The automation script
#!/bin/bash
GLOBAL_PKGS=~/.python-global-packages
SITE_PKGS=.venv/lib/python3.x/site-packages
packages=("tensorflow" "scipy")
for pkg in "${packages[@]}"; do
ln -s $GLOBAL_PKGS/${pkg}* $SITE_PKGS/
done
packages might have interdependencies, so you may need to symlink related packages together. Python versions needed to match between global and virtual environments.
We use poerty
poetry config cache-dir /path/to/shared/cache
Initialize project
poetry init
poetry add tensorflow numpy scipy
Answer to the comment - What could be the main disadvantage?
After experimenting more, I found that PDM does it:
But you need to activate it.
If a package is required by many projects on the system, each project has to keep its own copy. This can be a waste of disk space, especially for data science and machine learning projects.
PDM supports caching installations of the same wheel by installing it in a centralized package repository and linking to that installation in different projects. To enable it, run:
pdm config install.cache on
In addition, several different link methods are supported:
symlink(default), create symlinks to the package files.
hardlink, create hard links to the package files of the cache entry.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With