Can Python be configured to cache sys.path directory lookups?

Tags:

We've been doing a lot of benchmarking of Python running over a remote connection. The program is running offsite but accessing disks on-site. We are running under RHEL6. We watched a simple program with strace. It appears it's spending a lot of time performing stat and open on files to see if they are there. Over a remote connecting that is costly. Is there a way to configure Python to read a directories contents once and cache it's listing so it doesn't have to check it again?

Sample Program test_import.py:

import random
import itertools

I ran the following commands:

$ strace -Tf python test_import.py >& strace.out
$ grep '/usr/lib64/python2.6/' strace.out | wc
331    3160   35350

So it's looking in that directory roughly 331 times. A lot of them with results like:

stat ( "/usr/lib64/python2.6/posixpath", 0x7fff1b447340 ) = -1 ENOENT ( No such file or directory ) < 0.000009 >

If it cached the directory it wouldn't have to stat the file to see if it's there.

878

asked Aug 12 '14 02:08

Paul Hildebrandt

2 Answers

You can avoid this by either moving to Python 3.3, or replacing the standard import system with an alternative. In the strace talk that I gave two weeks ago at PyOhio, I discuss the unfortunate O(nm) performance (for n directories and m possible suffixes) of the old import mechanism; start at this slide.

I demonstrate how easy_install plus a Zope-powered web framework generates 73,477 system calls simply to do enough imports to get up and running.

After a quick install of bottle in a virtualenv on my laptop, for example, I find that exactly 1,000 calls are necessary for Python to import that module and be up and running:

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179           0      1519      1355 open
  0.00    0.000000           0       475       363 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179                  1994      1718 total

If I hop into os.py, however, I can add a caching importer and even with a very naive implementation can cut the number of misses down by nearly a thousand:

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041           0       699       581 open
  0.00    0.000000           0       301       189 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041                  1000       770 total

I chose os.py for the experiment because strace shows it to be the very first module that Python imports, and the sooner we can get our importer installed, the fewer Standard Library modules Python will have to import under its old terrible slow regime!

# Put this right below "del _names" in os.py

class CachingImporter(object):

    def __init__(self):
        self.directory_listings = {}

    def find_module(self, fullname, other_path=None):
        filename = fullname + '.py'
        for syspath in sys.path:
            listing = self.directory_listings.get(syspath, None)
            if listing is None:
                try:
                    listing = listdir(syspath)
                except OSError:
                    listing = []
                self.directory_listings[syspath] = listing
            if filename in listing:
                modpath = path.join(syspath, filename)
                return CachingLoader(modpath)

class CachingLoader(object):

    def __init__(self, modpath):
        self.modpath = modpath

    def load_module(self, fullname):
        if fullname in sys.modules:
            return sys.modules[fullname]
        import imp
        mod = imp.new_module(fullname)
        mod.__loader__ = self
        sys.modules[fullname] = mod
        mod.__file__ = self.modpath
        with file(self.modpath) as f:
            code = f.read()
        exec code in mod.__dict__
        return mod

sys.meta_path.append(CachingImporter())

This has rough edges, of course — it does not try to detect .pyc files or .so files or any of the other extensions that Python might go looking for. Nor does it know about __init__.py files or about modules inside of packages (which would require running lsdir() in sub-directories of the sys.path entries). But it at least illustrates that thousands of extra calls can be eliminated through something like this, and demonstrates a direction you might try out. When it cannot find a module, the normal import mechanism simply kicks in instead.

I wonder if there is a good caching importer already available on PyPI or somewhere? It seems like the sort of thing that would have been written a hundreds times already in various shops. I thought that Noah Gift had written one and put it in a blog post or something, but I cannot find a link that confirms that memory of mine.

Edit: as @ncoglan mentions in the comments, there is an alpha-release backport of the new Python 3.3+ import system to Python 2.7 available on PyPI: http://pypi.python.org/pypi/importlib2 — unfortunately it looks like the questioner is still stuck on 2.6.

187

answered Sep 21 '22 12:09

Brandon Rhodes

I know that this is not exactly what you are looking for but I'll answer anyway :D

There is no cache system for sys.path directories but zipimport creates an index of the modules inside of the .zip file. This index is used to make module lookup faster.

The drawback of this solution is that you cannot use it with binary modules (eg. .so) due to the lack of support in dlopen() that is used by Python to load this kind of module.

Other problem is that some modules (like the posixpath used on your example) is loaded by CPython interpreter during its bootstrap process.

PS. I hope you remember me at PythonBrasil when I helped you stuff some bags with Disney/Pixar souvenirs :D

answered Sep 19 '22 12:09

osantana

Related questions
                            
                                Converting JSON objects in to dictionary in python
                            
                                Can I run numpy and pandas with Jython
                            
                                calling apply() on an empty pandas DataFrame
                            
                                Testing IPython Notebooks
                            
                                Configuring root logger in python
                            
                                TypeError: dist must be a Distribution instance
                            
                                Can I change a Python bound method object's __str__() attribute?
                            
                                Why doesn't setup_requires work properly for numpy?
                            
                                Import a sequence of .svg files into FontForge as glyphs and output a font file
                            
                                Sphinx class attribute documentation
                            
                                sklearn ImportError: No module named _check_build
                            
                                Matplotlib: pcolor() does not plot last row and column?
                            
                                Deploy Flask app as windows service
                            
                                How to get excel sheet name in Python using xlrd
                            
                                Python converting datetime to be used in os.utime
                            
                                Celery: correct way to run lengthy initialization function (per process)
                            
                                how to add_argument_group to add_mutually_exclusive_group with python argparse
                            
                                How plot datetime.time in matplotlib?
                            
                                Are there any built-in functions which block on I/O that don't allow other threads to run?
                            
                                How to Convert pythons Decimal() type into an INT and exponent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can Python be configured to cache sys.path directory lookups?

Tags:

python

python-import

Paul Hildebrandt

People also ask

2 Answers

Brandon Rhodes

osantana

Recent Activity

Donate For Us