How to speed up python starting up and/or reduce file search while loading libraries?

Question

I have a framework composed of different tools written in python in a multi-user environment.

The first time I log to the system and start one command it takes 6 seconds just to show a few line of help. If I immediately issue the same command again it takes 0.1s. After a couple of minutes it gets back to 6s. (proof of short-lived cache)

The system sits on a GPFS so disk throughput should be ok, though access might be low because of the amount of files in the system.

strace -e open python tool | wc -l

shows 2154 files being accessed when starting the tool.

strace -e open python tool | grep ENOENT | wc -l

shows 1945 missing files being looked for. (A very bad hit/miss ratio is you ask me :-)

I have a hunch that the excessive time involved in loading the tool is consumed by querying the GPFS about all those files, and these are cached for the next call (at either system or GPFS level), though I don't know how to test/prove it. I have no root access to the system and I can only write to GPFS and /tmp.

Is it possible to improve this python quest for missing files?

Any idea on how to test this in a simple way? (Reinstalling everything on /tmp is not simple, as there are many packages involved, virtualenv will not help either (I think), since it's just linking the files on the gpfs system).

An option would be of course to have a daemon that forks, but that's far from "simple" and would be a last resort solution.

Thanks for reading.

sega_sai · Accepted Answer

How about using imp module? In particular there is a function: imp.find_module(module, path) here http://docs.python.org/2.7/library/imp.html

At least this example (see below) reduces the number of open() syscalls vs simple 'import numpy,scipy': (update: but it doesn't look like it is possible to achieve significant reductions of syscalls this way...)

import imp
import sys


def loadm(name, path):
    fp, pathname, description = imp.find_module(name,[path])
    try:
        _module = imp.load_module(name, fp, pathname, description)
        return _module
    finally:
        # Since we may exit via an exception, close fp explicitly.
        if fp:
            fp.close()


numpy = loadm("numpy", "/home/username/py-virtual27/lib/python2.7/site-packages/")
scipy = loadm("scipy", "/home/username/py-virtual27/lib/python2.7/site-packages/")

I guess you also better check that your PYTHONPATH is empty or small, because that can also increase the loading time.

Martijn Pieters · Answer

Python 2 looks for modules as relative to the current package first. If your library code has a lot of imports for a lot of top-level modules those are all looked up as relative first. So, if package foo.bar import os, then Python first looks for foo/bar/os.py. This miss is cached by Python itself too.

In Python 3, the default has moved to absolute imports instead; you can switch Python 2.5 and up to use absolute imports per module with:

from __future__ import absolute_import

Another source of file lookup misses is loading .pyc bytecode cache files; if those are missing for some reason (filesystem not writable for the current Python process) then Python will continue to look for those on every run. You can create these caches with the compileall module:

python -m compileall /path/to/directory/with/pythoncode

provided you run that with the correct write permissions.

How to speed up python starting up and/or reduce file search while loading libraries?

Tags:

python

startup

estani

2 Answers

sega_sai

Martijn Pieters

Recent Activity

Donate For Us

How to speed up python starting up and/or reduce file search while loading libraries?

Tags:

python

startup

estani

2 Answers

sega_sai

Martijn Pieters

Related questions

Recent Activity

Donate For Us