Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to use symlink in site-packages to reduce disk space usage?

Tags:

python

Every time I create a new project, it copies the entire package codebase into the project folder, which I find wasteful. For example, I don't want each of my projects to occupy 1 gigabyte worth of Tensorflow v2.8 disk space.

In other languages, we can easily avoid this. For instance, in Node.js, we can use pnpm or yarn berry. In Dart, this functionality is built-in by default. They utilize a global cache directory and the engine refers directly to ~/cache/.some-language/some-package/version/files (or its symlink in pnpm).

However, I can't seem to find a way to do this in Python. I read about all the dozens of Python package manager and I tried using uv because they advertise on their GitHub page:

💾 Disk-space efficient, with a global cache for dependency deduplication.

Unfortunately, it turns out that this was completely false and misleading. I tried it, and it only caches the package to reduce network usage the next time it is installed, but it still copies an entire instance into each project's .venv folder. This does not improve disk space usage at all.

For example:

myproject/.venv/site-packages % du -sh * | sort -hr | head -20

1.0G    tensorflow
326M    jaxlib
106M    mediapipe
106M    cv2
81M     scipy
72M     clang
46M     numpy

no-go answers:

  • using a global interpreter and installing packages on that
  • reusing environments (it slows down indexing of projects that don't require heavy packages like Tensorflow)
like image 370
TSR Avatar asked Oct 24 '25 06:10

TSR


2 Answers

It's a valid pain. I follow this approch.


The fundamental reason is Python's import system and how packages are installed. Python packages often contain compiled extensions and platform-specific code, making hard linking or sophisticated caching more challenging.

One way to reduce space is use docker

Create a base image

FROM python:3.11-slim as base
RUN pip install tensorflow numpy scipy

In project image

FROM base
COPY requirements.txt .
RUN pip install -r requirements.txt

Docker approch helping us to overcome plaform dependencies and alos provides layer caching.

Using symlinks

You need to maintain a global packages and install the lib's on it.

pip install --target ~/.python-global-packages tensorflow scipy numpy

Whenever i create a venv

cd project/.venv/lib/python3.11/site-packages

I use to do symlinks like so

ln -s ~/.python-global-packages/tensorflow* .
ln -s ~/.python-global-packages/scipy* .

The automation script

#!/bin/bash
GLOBAL_PKGS=~/.python-global-packages
SITE_PKGS=.venv/lib/python3.x/site-packages


packages=("tensorflow" "scipy")

for pkg in "${packages[@]}"; do
    ln -s $GLOBAL_PKGS/${pkg}* $SITE_PKGS/
done

Its has some disadvnatges

packages might have interdependencies, so you may need to symlink related packages together. Python versions needed to match between global and virtual environments.

This what we do in production.

We use poerty

poetry config cache-dir /path/to/shared/cache

Initialize project

poetry init
poetry add tensorflow numpy scipy

Answer to the comment - What could be the main disadvantage?

  1. When using symlinks, switching between projects could break dependencies if they share common packages but need different versions Package upgrades in one project could unintentionally affect other projects
  2. Some packages modify their contents at runtime Some packages check their own integrity or location Compiled extensions (.so/.pyd files) might have hard-coded paths
like image 67
Bhargav Avatar answered Oct 26 '25 20:10

Bhargav


After experimenting more, I found that PDM does it:

But you need to activate it.

If a package is required by many projects on the system, each project has to keep its own copy. This can be a waste of disk space, especially for data science and machine learning projects.

PDM supports caching installations of the same wheel by installing it in a centralized package repository and linking to that installation in different projects. To enable it, run:

pdm config install.cache on

In addition, several different link methods are supported:

  • symlink(default), create symlinks to the package files.

  • hardlink, create hard links to the package files of the cache entry.

like image 33
TSR Avatar answered Oct 26 '25 18:10

TSR