Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Python be configured to cache sys.path directory lookups?

We've been doing a lot of benchmarking of Python running over a remote connection. The program is running offsite but accessing disks on-site. We are running under RHEL6. We watched a simple program with strace. It appears it's spending a lot of time performing stat and open on files to see if they are there. Over a remote connecting that is costly. Is there a way to configure Python to read a directories contents once and cache it's listing so it doesn't have to check it again?

Sample Program test_import.py:

import random
import itertools

I ran the following commands:

$ strace -Tf python test_import.py >& strace.out
$ grep '/usr/lib64/python2.6/' strace.out | wc
331    3160   35350

So it's looking in that directory roughly 331 times. A lot of them with results like:

stat ( "/usr/lib64/python2.6/posixpath", 0x7fff1b447340 ) = -1 ENOENT ( No such file or directory ) < 0.000009 >

If it cached the directory it wouldn't have to stat the file to see if it's there.

like image 878
Paul Hildebrandt Avatar asked Aug 12 '14 02:08

Paul Hildebrandt


People also ask

How does Sys path get set in Python?

sys. path is a built-in variable within the sys module. It contains a list of directories that the interpreter will search in for the required module. When a module(a module is a python file) is imported within a Python file, the interpreter first searches for the specified module among its built-in modules.

Are Python imports cached?

Python caches all imported modules This all happened because Python caches modules. In Python, every module that is imported is stored in a dictionary called sys. modules .

Is Pythonpath same as SYS path?

PYTHONPATH is related to sys. path very closely. PYTHONPATH is an environment variable that you set before running the Python interpreter. PYTHONPATH , if it exists, should contain directories that should be searched for modules when using import .

Is Python a DIR OS?

isdir() method in Python is used to check whether the specified path is an existing directory or not. This method follows symbolic link, that means if the specified path is a symbolic link pointing to a directory then the method will return True. Parameter: path: A path-like object representing a file system path.


2 Answers

You can avoid this by either moving to Python 3.3, or replacing the standard import system with an alternative. In the strace talk that I gave two weeks ago at PyOhio, I discuss the unfortunate O(nm) performance (for n directories and m possible suffixes) of the old import mechanism; start at this slide.

I demonstrate how easy_install plus a Zope-powered web framework generates 73,477 system calls simply to do enough imports to get up and running.

After a quick install of bottle in a virtualenv on my laptop, for example, I find that exactly 1,000 calls are necessary for Python to import that module and be up and running:

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179           0      1519      1355 open
  0.00    0.000000           0       475       363 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000179                  1994      1718 total

If I hop into os.py, however, I can add a caching importer and even with a very naive implementation can cut the number of misses down by nearly a thousand:

$ strace -c -e stat64,open python -c 'import bottle'
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041           0       699       581 open
  0.00    0.000000           0       301       189 stat64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000041                  1000       770 total

I chose os.py for the experiment because strace shows it to be the very first module that Python imports, and the sooner we can get our importer installed, the fewer Standard Library modules Python will have to import under its old terrible slow regime!

# Put this right below "del _names" in os.py

class CachingImporter(object):

    def __init__(self):
        self.directory_listings = {}

    def find_module(self, fullname, other_path=None):
        filename = fullname + '.py'
        for syspath in sys.path:
            listing = self.directory_listings.get(syspath, None)
            if listing is None:
                try:
                    listing = listdir(syspath)
                except OSError:
                    listing = []
                self.directory_listings[syspath] = listing
            if filename in listing:
                modpath = path.join(syspath, filename)
                return CachingLoader(modpath)

class CachingLoader(object):

    def __init__(self, modpath):
        self.modpath = modpath

    def load_module(self, fullname):
        if fullname in sys.modules:
            return sys.modules[fullname]
        import imp
        mod = imp.new_module(fullname)
        mod.__loader__ = self
        sys.modules[fullname] = mod
        mod.__file__ = self.modpath
        with file(self.modpath) as f:
            code = f.read()
        exec code in mod.__dict__
        return mod

sys.meta_path.append(CachingImporter())

This has rough edges, of course — it does not try to detect .pyc files or .so files or any of the other extensions that Python might go looking for. Nor does it know about __init__.py files or about modules inside of packages (which would require running lsdir() in sub-directories of the sys.path entries). But it at least illustrates that thousands of extra calls can be eliminated through something like this, and demonstrates a direction you might try out. When it cannot find a module, the normal import mechanism simply kicks in instead.

I wonder if there is a good caching importer already available on PyPI or somewhere? It seems like the sort of thing that would have been written a hundreds times already in various shops. I thought that Noah Gift had written one and put it in a blog post or something, but I cannot find a link that confirms that memory of mine.

Edit: as @ncoglan mentions in the comments, there is an alpha-release backport of the new Python 3.3+ import system to Python 2.7 available on PyPI: http://pypi.python.org/pypi/importlib2 — unfortunately it looks like the questioner is still stuck on 2.6.

like image 187
Brandon Rhodes Avatar answered Sep 21 '22 12:09

Brandon Rhodes


I know that this is not exactly what you are looking for but I'll answer anyway :D

There is no cache system for sys.path directories but zipimport creates an index of the modules inside of the .zip file. This index is used to make module lookup faster.

The drawback of this solution is that you cannot use it with binary modules (eg. .so) due to the lack of support in dlopen() that is used by Python to load this kind of module.

Other problem is that some modules (like the posixpath used on your example) is loaded by CPython interpreter during its bootstrap process.

PS. I hope you remember me at PythonBrasil when I helped you stuff some bags with Disney/Pixar souvenirs :D

like image 32
osantana Avatar answered Sep 19 '22 12:09

osantana