Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is conda install a thread-safe operation?

Tags:

python

conda

I would like to install packages into multiple conda environments. Doing this one after the other takes quite some time, so it would be nice if I could run all the conda install steps for each environment in parallel. Would this be possible or are there conflicts (relating to hard links and lock files, possibly) when trying to run conda in parallel?

like image 863
RedbackThomson Avatar asked Oct 02 '19 22:10

RedbackThomson


People also ask

How to install Conda in Anaconda?

Anaconda’s default channel alone has around 635 packages. It is better to install only the packages you require for your application. To do so, go to Anaconda prompt and type conda install command comes in with a range of options. You can refer to them using conda install —help.

What is a Conda package?

Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.

Does Conda update make the environment inconsistent?

conda update will not make the environment inconsistent or, through inaction, allow an environment to become inconsistent. conda update will install the packages explicitly requested by the user on the command line, except when it conflicts with the First Law.

What repository does Conda use for install?

The default repository that conda uses when you run conda install is the Anaconda Distribution Repository. It has about 600 Python packages and a similar number of R packages. pip is similar to conda in many respects, but it’s not closely connected to a particular distribu Is NetworkX in Anaconda?


Video Answer


1 Answers

The short answer: No, it should not be run concurrently.

Most of how Conda handles transaction safety was established in version v4.3. The release notes in v4.3.0 regarding changes to locks explicitly comment on running multiple processes:

[U]sers are cautioned that undefined behavior can result when conda is running in multiple process and operating on the same package caches and/or environments.

It sounds like you're talking about different environments, so that shouldn't be an issue. However, you need to ensure that the package(s) to be installed is already downloaded into the package cache, otherwise it is not safe.

Partial Parallel Strategy

There is a --download-only flag, which will only add the package to the package cache (i.e., the part that cannot be done concurrently). But the issue is that this would still need to be done on a per-env basis, since different envs could have different constraints (e.g., different Python versions) that require different builds of the package.

I think the best you could do at the CLI is

  1. Run conda install --download-only pkg sequentially on each env, then
  2. Run conda install pkg in parallel for the envs.

This is, however, not in any official recommendation, and changes in how Conda does transactions could lead to this not being safe. I'll also say that I very much doubt this will save you much time; in fact, it might take longer. This approach will involve every env having to solve and prepare transactions twice, and that is usually the most computationally intensive step. The part you end up parallelizing involves disk transactions, which is going to be I/O bound, so I kind of doubt any time will be saved.

Some Evidence For This Being Safe

While this doesn't positively prove its safety, we can explicitly examine the transactions to make sure that when we run Step 2 above, it will only involve LINK transactions.

To test this, I made two envs:

conda create -n foo -y python=3.6
conda create -n bar -y python=3.6

Then I check the output from

conda install -n foo -d --json pandas

which shows a list of both FETCH and LINK transactions. The former involve manipulating the package cache, whereas the latter only the env. If I then run

conda install -n foo --download-only pandas

and check again,

conda install -n foo -d --json pandas

I now see only LINK transactions. Notably, the same is now true for -n bar, which should reinforce the fact that Step 1 should be done sequentially. The good part is that it won't lead to redownloading the same package; the bad part, that it involves a solve happening in every env. In a more heterogenous environment, we could expect there might be different FETCH operations in each env.

Finally, I can run the parallel final install

conda install -n foo -y pandas & conda install -n bar -y pandas &

which is safe if we can assume that that LINK transactions in different envs are safe.

like image 188
merv Avatar answered Oct 10 '22 09:10

merv