Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up an already cached pip install?

I frequently have to re-create virtual environments from a requirements.txt and I am already using $PIP_DOWNLOAD_CACHE. It still takes a lot of time and I noticed the following:

Pip spends a lot of time between the following two lines:

Downloading/unpacking SomePackage==1.4 (from -r requirements.txt (line 2))
  Using download cache from $HOME/.pip_download_cache/cached_package.tar.gz

Like ~20 seconds on average to decide it's going to use the cached package, then the install is fast. This is a lot of time when you have to install dozens of packages (actually enough to write this question).

What is going on in the background? Are they some sort of integrity checks against the online package?

Is there a way to speed this up?

edit: Looking at:

time pip install -v Django==1.4

I get:

real    1m16.120s
user    0m4.312s
sys     0m1.280s

The full output is here http://pastebin.com/e4Q2B5BA. Looks like pip is spending his time looking for a valid download link while it already has a valid cache of http://pypi.python.org/packages/source/D/Django/Django-1.4.tar.gz.

Is there a way to look for the cache first and stop there if versions match?

like image 604
Maxime R. Avatar asked Sep 13 '12 15:09

Maxime R.


2 Answers

After spending some time to study the pip internals and to profile some package installations I came to the conclusion that even with a download cache, pip does the following for each package :

  • go to the main index url, usually http://pypi.python.org/simple// (example)
  • follows every link to fetch additional web pages
  • extracts all links from all those pages
  • checks the validity of all the links against the package name and version requirements
  • selects the most recent version from the valid links

Now pip has a download url, checks against the download cache folder if configured and eventually decides not to use this url if a local file named after the url is present.

My guess is that we could save a lot of time by checking the cache upfront but I do not have a good enough understanding of all the pip code base to start the required modifications. Of course it would only be for exact version number requirements, ==, because with other constraints, like >= or >, we still want to crawl the web looking for the latest version.

Nevertheless, I was able to make a small pull request which will save us some time if merged.

like image 193
Maxime R. Avatar answered Oct 05 '22 13:10

Maxime R.


One alternative may be to avoid rebuilding the virtualenv and to instead take a copy of a master virtual environment that you can update and copy as required.

virtualenvwrapper provides some support for doing this with the cpvirtualenv command

like image 39
Andrew Walker Avatar answered Oct 05 '22 12:10

Andrew Walker