pip install : Lightspeed and Bulletproof

I saw a post about speeding up the Python packaging command “pip install”, by specifying more responsive mirrors for querying and downloading packages. For my situation, a better tactic is this.

Step one: Download all your project’s dependencies into a local ‘packages’ dir, but don’t install them yet:

mkdir packages
pip install --download=packages -r requirements.txt

Step two, install from the ‘packages’ dir:

pip install --no-index --find-links=packages -r requirements.txt

(The above syntax works on pip 1.3, released yesterday. Docs for older versions of pip claim to support this, but in practice, for pip 1.2, I’ve had to use “--find-links=file://$PWD/packages“)

Step 2 works even if PyPI is unreachable. It works even if some of your dependencies are self-hosted by the authors, and that website is unreachable. It works even if the version you have pinned of one of your dependencies has been deleted by the author (some packages do this routinely after security updates.) It works even if you have no network connection at all. In short, it makes creation of your virtualenv bulletproof.

As a nice side effect, it runs really fast, because it isn’t downloading the packages across the internet, nor is it attempting to scan a remote index to check for matching or newer versions of each package. This is much quicker than just using a Pip download cache, especially for large projects with many dependencies which only change occasionally.

At Rangespan, we check the ‘packages’ directory into source control, so that once you’ve checked out a project’s repo, you have everything you need to deploy locally and run, even if you have no network. You might choose to treat ‘packages’ as ephemeral.

It was pointed out to me recently by @jezdez, Pip maintainer, this usage pattern has now been explicitly called out in the documentation, which was substantially reorganised and improved with the recent 1.3 release.

16 thoughts on “pip install : Lightspeed and Bulletproof

  1. We wrote a tool called Terrarium to do basically this same thing: https://github.com/PolicyStat/terrarium

    It creates a relocatable virtualenv for you and stores it on S3 or a shared file system. Then subsequent installs with the same requirements will just pull down the existing virtualenv, making it just a file copy operation (plus some path cleanup in some cases).

    • Andy — the idea is that you commit the contents of the “packages” directory to version control along with your own source code. When you check the project out on a new laptop, or to a new production machine, you need to re-run “pip” in that new environment to get all the dependencies to appear. The “packages” directory means that, when that moment comes, you know that the install will succeed without needing the network — many has been the unhappy developer who, with their client or boss panicked that the service is down and a new production machine needs to be spun up, has discovered that PyPI is down or that some dependency has been installing the whole time from some *other* web site that is for the moment down.

    • A lot of python developers use Virtualenv which are a way to install python libraries in separate, isolated containers. It’s great when you need to use specific version of library, want to test a new library or want to install a library without root access or don’t want to taint your whole system with it. But then you may need to install multiple time the sames library which is why it’s nice to have an offline cache

  2. You can also run pip like so:

    PIP_DOWNLOAD_CACHE=$HOME/.pip_cache pip install foo

    Which will create and place packages in the cache, and use those before downloading them again.

    • The download cache is quite a bit more fragile than the technique demonstrated in the article; really it doesn’t add any robustness against network failure, it just speeds things up (very slightly) if some of the packages are large. The reason is that the download cache only stores sdist tarballs indexed by URL. When using a download cache pip still has to use its normal PyPI+external-hosts scraping to figure out which sdists at which URLs to download in order to fulfill your requirements; only then can it check whether it has the sdists at those URLs cached already. So you still have most of the network dependency and latency. With the technique in the article you avoid _all_ network access.

  3. Wow, great to see that pip 1.3 properly handles recursive dependencies with –no-install –download! This is so much better than before! Good work guys!

  4. Yeah I have a lot of “-f file:///” going on with pip, its interesting that 1.3 no longer needs that, but with “file:///” I also had to use absolute paths, as the pathing arithmetic within pip was also not consistent. Here’s a github issue import of the ticket where I was complaining about this use case: https://github.com/pypa/pip/issues/111 (still open for some reason, I think its solved…). If they’ve finally fixed the absolute pathing requirement then things are just peachy.

  5. Pingback: A Smattering of Selenium #147 | Official Selenium Blog

  6. I’m not sure what benefit this gives you over simply installing them locally using:

    pip install -r requirements.txt –target=packages

    This will download and install them into a directory.

    • Hey Daniel. Everything I cite in the post as an advantage of the method I describe (e.g. “works even if PyPI is unreachable”) is something that is wrong with the command you quote. (e.g. the command you show will fail if PyPI is down or unreachable, and hence should not be used in your production deployments. Otherwise you will be unable to deploy new releases or urgent fixes just because PyPI is down, or because some other package author (accidentally?) deleted their package, etc, etc. Plus it’s really slow, which is a problem if you routinely re-create your virtualenv, for example when running tests. (which you should, for system / installer / deployment tests)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>