We've been doing a lot of benchmarking of Python running over a remote connection. The program is running offsite but accessing disks on-site. We are running under RHEL6. We watched a simple program with strace. It appears it's spending a lot of time performing stat and open on files to see if they are there. Over a remote connecting that is costly. Is there a way to configure Python to read a directories contents once and cache it's listing so it doesn't have to check it again?
Sample Program test_import.py:
import random
import itertools
I ran the following commands:
$ strace -Tf python test_import.py >& strace.out
$ grep '/usr/lib64/python2.6/' strace.out | wc
331 3160 35350
So it's looking in that directory roughly 331 times. A lot of them with results like:
stat ( "/usr/lib64/python2.6/posixpath", 0x7fff1b447340 ) = -1 ENOENT ( No such file or directory ) < 0.000009 >
If it cached the directory it wouldn't have to stat the file to see if it's there.
sys. path is a built-in variable within the sys module. It contains a list of directories that the interpreter will search in for the required module. When a module(a module is a python file) is imported within a Python file, the interpreter first searches for the specified module among its built-in modules.
Python caches all imported modules This all happened because Python caches modules. In Python, every module that is imported is stored in a dictionary called sys. modules .
PYTHONPATH is related to sys. path very closely. PYTHONPATH is an environment variable that you set before running the Python interpreter. PYTHONPATH , if it exists, should contain directories that should be searched for modules when using import .
isdir() method in Python is used to check whether the specified path is an existing directory or not. This method follows symbolic link, that means if the specified path is a symbolic link pointing to a directory then the method will return True. Parameter: path: A path-like object representing a file system path.
You can avoid this by either moving to Python 3.3, or replacing the standard import system with an alternative. In the strace
talk that I gave two weeks ago at PyOhio, I discuss the unfortunate O(nm) performance (for n directories and m possible suffixes) of the old import mechanism; start at this slide.
I demonstrate how easy_install
plus a Zope-powered web framework generates 73,477 system calls simply to do enough imports to get up and running.
After a quick install of bottle
in a virtualenv on my laptop, for example, I find that exactly 1,000 calls are necessary for Python to import that module and be up and running:
$ strace -c -e stat64,open python -c 'import bottle'
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000179 0 1519 1355 open
0.00 0.000000 0 475 363 stat64
------ ----------- ----------- --------- --------- ----------------
100.00 0.000179 1994 1718 total
If I hop into os.py
, however, I can add a caching importer and even with a very naive implementation can cut the number of misses down by nearly a thousand:
$ strace -c -e stat64,open python -c 'import bottle'
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000041 0 699 581 open
0.00 0.000000 0 301 189 stat64
------ ----------- ----------- --------- --------- ----------------
100.00 0.000041 1000 770 total
I chose os.py
for the experiment because strace
shows it to be the very first module that Python imports, and the sooner we can get our importer installed, the fewer Standard Library modules Python will have to import under its old terrible slow regime!
# Put this right below "del _names" in os.py
class CachingImporter(object):
def __init__(self):
self.directory_listings = {}
def find_module(self, fullname, other_path=None):
filename = fullname + '.py'
for syspath in sys.path:
listing = self.directory_listings.get(syspath, None)
if listing is None:
try:
listing = listdir(syspath)
except OSError:
listing = []
self.directory_listings[syspath] = listing
if filename in listing:
modpath = path.join(syspath, filename)
return CachingLoader(modpath)
class CachingLoader(object):
def __init__(self, modpath):
self.modpath = modpath
def load_module(self, fullname):
if fullname in sys.modules:
return sys.modules[fullname]
import imp
mod = imp.new_module(fullname)
mod.__loader__ = self
sys.modules[fullname] = mod
mod.__file__ = self.modpath
with file(self.modpath) as f:
code = f.read()
exec code in mod.__dict__
return mod
sys.meta_path.append(CachingImporter())
This has rough edges, of course — it does not try to detect .pyc
files or .so
files or any of the other extensions that Python might go looking for. Nor does it know about __init__.py
files or about modules inside of packages (which would require running lsdir()
in sub-directories of the sys.path
entries). But it at least illustrates that thousands of extra calls can be eliminated through something like this, and demonstrates a direction you might try out. When it cannot find a module, the normal import mechanism simply kicks in instead.
I wonder if there is a good caching importer already available on PyPI or somewhere? It seems like the sort of thing that would have been written a hundreds times already in various shops. I thought that Noah Gift had written one and put it in a blog post or something, but I cannot find a link that confirms that memory of mine.
Edit: as @ncoglan mentions in the comments, there is an alpha-release backport of the new Python 3.3+ import system to Python 2.7 available on PyPI: http://pypi.python.org/pypi/importlib2 — unfortunately it looks like the questioner is still stuck on 2.6.
I know that this is not exactly what you are looking for but I'll answer anyway :D
There is no cache system for sys.path
directories but zipimport
creates an index of the modules inside of the .zip
file. This index is used to make module lookup faster.
The drawback of this solution is that you cannot use it with binary modules (eg. .so) due to the lack of support in dlopen()
that is used by Python to load this kind of module.
Other problem is that some modules (like the posixpath
used on your example) is loaded by CPython interpreter during its bootstrap process.
PS. I hope you remember me at PythonBrasil when I helped you stuff some bags with Disney/Pixar souvenirs :D
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With