Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python dependency hell: A compromise between virtualenv and global dependencies?

I've tested various ways to manage my project dependencies in Python so far:

  1. Installing everything global with pip (saves spaces, but sooner or later gets you in trouble)
  2. pip & venv or virtualenv (a bit of a pain to manage, but ok for many cases)
  3. pipenv & pipfile (a little bit easier than venv/virtualenv, but slow and some vendor-lock, virtual envs hide somewhere else than the actual project folder)
  4. conda as package and environment manager (great as long as the packages are all available in conda, mixing pip & conda is a bit hacky)
  5. Poetry - I haven't tried this one
  6. ...

My problem with all of these (except 1.) is that my harddrive space fills up pretty fast: I am not a developer, I use Python for my daily work. Therefore, I have hundreds of small projects that all do their thing. Unfortunately, for 80% of projects I need the "big" packages: numpy, pandas, scipy, matplotlib - you name it. A typical small project is about 1000 to 2000 lines of code, but has 800MB of package dependencies in venv/virtualenv/pipenv. Virtually I have about 100+ GB of my HDD filled with python virtual dependencies.

Moreover, installing all of these in each virtual environment takes time. I am working in Windows, many packages cannot be easily installed from pip in windows: Shapely, Fiona, GDAL - I need the precompiled wheels from Christoph Gohlke. This is easy, but it breaks most workflows (e.g. pip install -r requirements.txt or pipenv install from pipfile). I feel like I am 40% installing/updating package dependencies and only 60% of my time writing code. Further, none of these package managers really help with publishing & testing code, so I need other tools e.g. setuptools, tox, semantic-release, twine...

I've talked to colleagues but they all face the same problem and no one seems to have a real solution. I was wondering if there is an approach to have some packages, e.g. the ones you use in most projects, installed globally - for example, numpy, pandas, scipy, matplotlib would be installed with pip in C:\Python36\Lib\site-packages or with conda in C:\ProgramData\Miniconda3\Lib\site-packages - these are well developed packages that don't often break things. And if, I would like to fix that anyway soon in my projects.

Other things would go in local virtualenv-folders - I am tempted to move my current workflow from pipenv to conda.

Does such an approach make sense at all? At least there has been a lot of development lately in python, perhaps something emerged that I didn't see yet. Is there any best-practice guidance on how to setup files in such a mixed global-local environment, e.g. how to maintain setup.py, requirements.txt or pyproject.toml for sharing development projects through Gitlab, Github etc.? What are the pitfalls/caveats?

There's also this great blog post from Chris Warrick that explains it pretty much fully.

[Update 2021]

Since this post still gets many views, here is a subjective 2021 update:

  • if you are in data science, (mini)conda is still worth a look
  • otherwise, Poetry and the pyproject.toml seem to be the common agreed upon denominator

[Update 2020]

After half a year, I can say that working with Conda (Miniconda) has solved most of my problems:

  • it runs on every system, WSL, Windows, native Linux etc. conda env create -f myenv.yml is the same on every platform
  • most packages are already available on conda-forge, it is easy to get own packages accepted on conda-forge
  • for those packages not on conda, I can install pip in conda environment and add packages from pypi with pip. Hint: conda update --all -n myenv -c conda-forge will only update packages from conda, not those installed with pip. Pip installed dependencies must be updated manually with pip install pack_name --upgrade. Note that installing packages with pip in conda is an emergency solution that should typically be avoided
  • I can create strict or open environment.yml, specifying the conda channel priority, the packages from conda and the packages from pip
  • I can create conda environments from those ymls in a single statement, e.g. to setup a dev environment in Gitlab Continuous Integration, using the Miniconda3 Docker - this makes test-runs very simple and straight forward
  • package versions in ymls can be defined strict or open, depending on the situation. E.g. you can fix the env to Python 3.6, but have it retrieve any security updates in this version-range (e.g. 3.6.9)
  • I found that conda solves almost all problems with c-compiled dependencies in Windows; conda env's in Windows do allow freezing python code into an executable (tested!) that can be distributed to Windows end-users who cannot use package managers for some reason.
  • regarding the issue with "big dependencies": I ended up creating many specific (i.e. small) and a few unspecific (i.e. big) conda environments: For example, I have a quite big jupyter_env, where jupyter lab and most of my scientific packages are installed (numpy, geos, pandas scipy etc.) - I activate it whenever I need access to these tools, I can keep those up to date in a single place. For development of specific packages, I have extra environments that are only used for the package-dependencies (e.g. packe1_env). I have about 10 environemnts overall, which is manageable. Some general purpose tools are installed in the base conda environment, e.g. pylint. Be warned: to make pylint/pycodestyle/autopep8 etc. work (e.g.) in VS Code, it must be installed to the same env that contains the python-code-dependencies - otherwise, you'll get unresolved import warnings
  • I installed miniconda with Chocolatey package manager for windows. I keep it up to date with conda update -n base conda, and my envs with conda update --all -n myenv -c conda-forge once a week, works like a charm!
  • New Update: there's a --stack flag available (as of 2019-02-07) that allows stacking conda environments, e.g. conda activate my_big_env then conda activate --stack dev_tools_env allows making some general purpose packages available in many envs. However, use with caution - I found that code linters, such as pylint, must be in the same env as the dependencies of the code that is linted
  • New Update 2: I started using conda from Windows Subsystem for Linux (WSL), this improved again my workflow significantly: packages are installed faster, I can work with VS Code Insiders in Windows directly connected to WSL and there're far less bugs with python packages in the Linux environment.
  • Another Update on a side note, the Miniconda Docker allows converting local conda env workflows flawlessly into containerized infrastructure (CI & CD), tested this for a while now and pretty happy with it - the Dockerfile is cleaner than with Python Docker because conda manages more of the dependency work than pip does. I use this nowadays more and more, for example, when working with jupyter lab, which is started from within a container.
  • yes, I stumbled into compatibility problems between certain packages in a conda env, but very rarely. There're two approaches: if it is an important env that must work stable, enable conda config --env --set channel_priority strict - this will only install versions that are compatible. With very few and rare package combinations, this may result in unsolvable dependency conflicts (i.e. the env cannot be created). In this case, I usually create smaller envs for experimental development, with less packages and channel_priority set to flexible (the default). Sometimes, package subsets exists that are easier to solve such as geoviews-core (instead of geoviews) or matplotlib-base (instead of matplotlib). It's also a good approach to try lower python versions for those experimental envs that are unsolvable with strict, e.g. conda create -n jupyter_exp_env python=3.6 -c conda-forge. A last-resort hack is installing packages with pip, which avoids conda's package resolver (but may result in unstable environments and other issues, you've been warned!). Make sure to explicitly install pip in your env first.

One overall drawback is that conda gets kind of slow when using the large conda-forge channel. They're working on it, but at the same time conda-forge index is growing really fast.

like image 955
Alex Avatar asked Feb 01 '19 07:02

Alex


People also ask

How do you solve dependency hell?

These long chains of dependencies can be solved by having a package manager that resolves all dependencies automatically. Other than being a hassle (to resolve all the dependencies manually), manual resolution can mask dependency cycles or conflicts.

How do you resolve dependency conflicts in pip?

Unfortunately, pip makes no attempt to resolve dependency conflicts. For example, if you install two packages, package A may require a different version of a dependency than package B requires. Pip can install from either Source Distributions (sdist) or Wheel (. whl) files.

When you use pip to install a package that requires one or more dependencies then?

Pip relies on package authors to stipulate the dependencies for their code in order to successfully download and install the package plus all required dependencies from the Python Package Index (PyPI). But if packages are installed one at a time, it may lead to dependency conflicts.

Where does Python store dependencies?

Inside env/ there will be a directory called lib which will contain Python and will store your dependencies. Then any time you return to the project, run source env/bin/activate again so that the dependencies can be found.


2 Answers

I was wondering if there is an approach to have some packages, e.g. the ones you use in most projects, installed globally ... Other things would go in local virtualenv-folders

Yes, virtualenv supports this. Install the globally-needed packages globally, and then, whenever you create a virtualenv, supply the --system-site-packages option so that the resulting virtualenv will still be able to use globally-installed packages. When using tox, you can set this option in the created virtualenvs by including sitepackages=true in the appropriate [testenv] section(s).

like image 149
jwodder Avatar answered Sep 30 '22 08:09

jwodder


Problem

You have listed a number of issues that no one approach may be able to completely resolve:

  • space

'I need the "big" packages: numpy, pandas, scipy, matplotlib... Virtually I have about 100+ GB of my HDD filled with python virtual dependencies'

  • time

... installing all of these in each virtual environment takes time

  • publishing

... none of these package managers really help with publishing & testing code ...

  • workflow

I am tempted to move my current workflow from pipenv to conda.

Thankfully, what you have described is not quite the classic dependency problem that plagues package managers - circular dependencies, pinning dependencies, versioning, etc.


Details

I have used conda on Windows many years now under similar restrictions with reasonable success. Conda was originally designed to make installing scipy-related packages easier. It still does.

If you are using the "scipy stack" (scipy, numpy, pandas, ...), conda is your most reliable choice.

Conda can:

  • install scipy packages
  • install C-extensions and non-Python packages (needed to run numpy and other packages)
  • integrate conda packages, conda channels (you should look into this) and pip to access packages
  • dependency separation with virtual environments

Conda can't:

  • help with publishing code

Reproducible Envs

The following steps should help reproduce virtualenvs if needed:

  • Do not install scipy packages with pip. I would rely on conda to do the heavy lifting. It is much faster and more stable. You can pip install less common packages inside conda environments.
  • On occasion, a pip package may conflict with conda packages within an environment (see release notes addressing this issue).

Avoid pip-issues

I was wondering if there is an approach to have some packages, e.g. the ones you use in most projects, installed globally ... Other things would go in local virtualenv-folders

Non-conda tools

  • pipx is a pip-like tool that creates global virtual environments.
  • virtualenv traditionally makes virtual environments per project, but thankfully @jwodder's answer explains how to use global packages.
  • virtualenv-wrapper facilitates global virtualenvs.

conda

However, if you want to stay with conda, you can try the following:

A. Make a working environment separate from your base environment, e.g. workenv. Consider this your goto, "global" env to do a bulk of your daily work.

> conda create -n workenv python=3.7 numpy pandas matplotblib scipy > activate workenv (workenv)> 

B. Test installations of uncommon pip packages (or weighty conda packages) within a clone of the working env

> conda create --name testenv --clone workenv > activate testenv (testenv)> pip install pint 

Alternatively, make new environments with minimal packages using a requirements.txt file

C. Make a backup of dependencies into a requirements.txt-like file called environment.yml per virtualenv. Optionally make a script to run this command per environment. See docs on sharing/creating environment files. Create environments in the future from this file:

> conda create --name testenv --file environment.yml > activate testenv (testenv)> conda list 

Publishing

The packaging problem is an ongoing, separate issue that has gained traction with the advent of pyproject.toml file via PEP 518 (see related blog post by author B. Cannon). Packaging tools such as flit or poetry have adopted this modern convention to make distributions and publish them to a server or packaging index (PyPI). The pyproject.toml concept tries to move away from traditional setup.py files with specific dependence to setuptools.

Dependencies

Tools like pipenv and poetry have a unique modern approach to addressing the dependency problem via a "lock" file. This file allows you to track and reproduce the state of your dependency graphs, something novel in the Python packaging world so far (see more on Pipfile vs. setup.py here). Moreover, there are claims that you can still use these tools in conjunction with conda, although I have not tested the extent of these claims. The lock file isn't standardized yet, but according to core developer B. Canon in an interview on The future of Python packaging, (~33m) "I'd like to get us there." (See Updates).

Summary

If you are working with any package from the scipy stack, use conda (Recommended):

  • To conserve space, time and workflow issues use conda or miniconda.
  • To resolve deploying applications or using a "lock" file on your dependencies, consider the following in conjunction with conda:
    • pipenv: use to deploy and make Pipfile.lock
    • poetry: use to deploy and make poetry.lock
  • To publish a library on PyPI, consider:
    • pipenv: develop via pipenv install -e. and manually publish with twine
    • flit: automatically package and *publish
    • poetry: automatically package and publish

See Also

  • conda docs on managing environment files.
  • Podcast interview with B. Cannon discussing the general packaging problem, pyproject.toml, lock files and tools.
  • Podcast interview with K. Reitz discussing packaging tools (pipenv vs. pip, 37m) and dev environment.

Updates:

  • A new dependency resolver is shipped with pip 21.0.
  • PEP 665 proposes a standardized lock-file (c. 2021)
like image 28
pylang Avatar answered Sep 30 '22 09:09

pylang