Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Heroku installing old Python (pip) dependencies in new deployments?

(Asking here as directed by Heroku's own support)

We have just uncovered a dependency issue in a project with mismatched libraries between dev environments. The details aren't relevant, but the underlying cause was a dependency that had a ">=" version match in its setup.py - which meant that when a dev rebuilt his environment he suddenly got the latest version (0.4.0) instead of the old version he'd had previously (0.3.11), and started getting a DeprecationWarning.

As part of the debugging process, I was under the impression that whenever a repo is pushed to Heroku, a clean environment is rebuilt, which led me to assume, incorrectly, that our DEV environment (which is rebuilt daily) would have had the latest version installed. Because we weren't seeing the issue on the dev environment I decided to investigate, and ran heroku run pip list on the remote environment.

I was (very) surprised to see that the output of this was a lucky dip of old and expired dependencies, and not a clean environment at all. It turns out that we may have had the issue that we were debugging living happily on our live environment, as part of an old install.

The easiest way to explain this is the BeautifulSoup library. We recently updated from v3 to v4, and as part of this, the library itself changed name on PyPI from BeautifulSoup to beautifulsoup4. We updated out requirements.txt to reflect this, but if I now run pip list on our Heroku environment I get both:

~ $ heroku run bash
~ $ pip list
BeautifulSoup (3.2.1)
beautifulsoup4 (4.3.2)

So, the old dependency hasn't been cleared, it's just sitting there. I can prove it easily enough by firing up a python session:

~ $ python
Python 2.7.4 (default, Apr  6 2013, 22:14:13)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import BeautifulSoup
>>>

This is a bit of a shock, and I'm amazed that this hasn't killed off our application at some point?

So, the question is - how does Heroku manage dependencies under the hood - it obviously doesn't wipe the python environment and re-run pip install on each deployment - and is there any way I can force this behaviour?

[EDIT 1]

FWIW, this is the buildpack that does the install - https://github.com/heroku/heroku-buildpack-python/blob/master/bin/compile

[EDIT 2]

From the Heroku Buildpack docs:

The contents of CACHE_DIR will be persisted between builds. You can cache the results of long processes like dependency resolution here to speed up future builds.

And further down:

The bin/compile script will be given a CACHE_DIR as its second argument which can be used to store artifacts between builds. Artifacts stored in this directory will be available in the CACHE_DIR during successive builds. CACHE_DIR is available only during slug compilation, and is specific to the app being built.

The recommendation is:

Heroku users can use the heroku-repo plugin to clear the build cache created by the buildpack they use for their app

That said - whilst I understand that the cache is used to speed up future compilations, I don't understand why everything in the cache is installed. That doesn't make any sense?

like image 966
Hugo Rodger-Brown Avatar asked Oct 21 '22 06:10

Hugo Rodger-Brown


1 Answers

Heroku installs your Python environment in /app/.heroku/python and the entire .heroku directory gets copied in from the CACHE_DIR at the beginning of each build and then back to the CACHE_DIR at the end. (If you search for restore_cache and dump_cache in that buildpack script you'll see the lines that do this.)

So once you've installed something in your Heroku app, there it stays unless the CACHE_DIR somehow gets wiped. Obviously this isn't ideal, but it was done to avoid the very long build times that would result from recompiling and installing all your dependencies on every deploy. (For what it's worth, I think there's now a better way of achieving this using wheels to cache compiled packages so you can get a fresh environment every time without having to wait for everything to rebuild. I might try to put together an example and poke Kenneth Reitz about it.)

In the immediate term, it sounds like you would benefit from explicitly pinning the versions of all your dependencies (which include dependencies of dependencies and so on). Simply running pip freeze > requirements.txt would be a good start, although you might also want to look at pip-tools. You really want to avoid the situation you just described where a dev rebuilds their environment from requirements.txt and ends up with a different version of a package.

As for BeautifulSoup changing their package name (but not, presumably, the name of the module you import) that sounds really horrible, and exactly the sort of situation in which Heroku's caching breaks down! I think the only solution would be to use that heroku-repo plugin and nuke your cache entirely.

like image 179
D. Evans Avatar answered Oct 23 '22 04:10

D. Evans