I have a framework composed of different tools written in python in a multi-user environment.
The first time I log to the system and start one command it takes 6 seconds just to show a few line of help. If I immediately issue the same command again it takes 0.1s. After a couple of minutes it gets back to 6s. (proof of short-lived cache)
The system sits on a GPFS so disk throughput should be ok, though access might be low because of the amount of files in the system.
strace -e open python tool | wc -l
shows 2154 files being accessed when starting the tool.
strace -e open python tool | grep ENOENT | wc -l
shows 1945 missing files being looked for. (A very bad hit/miss ratio is you ask me :-)
I have a hunch that the excessive time involved in loading the tool is consumed by querying the GPFS about all those files, and these are cached for the next call (at either system or GPFS level), though I don't know how to test/prove it. I have no root access to the system and I can only write to GPFS and /tmp.
Is it possible to improve this python quest for missing files
?
Any idea on how to test this in a simple way? (Reinstalling everything on /tmp is not simple, as there are many packages involved, virtualenv will not help either (I think), since it's just linking the files on the gpfs system).
An option would be of course to have a daemon that forks, but that's far from "simple" and would be a last resort solution.
Thanks for reading.
How about using imp module? In particular there is a function: imp.find_module(module, path) here http://docs.python.org/2.7/library/imp.html
At least this example (see below) reduces the number of open() syscalls vs simple 'import numpy,scipy': (update: but it doesn't look like it is possible to achieve significant reductions of syscalls this way...)
import imp
import sys
def loadm(name, path):
fp, pathname, description = imp.find_module(name,[path])
try:
_module = imp.load_module(name, fp, pathname, description)
return _module
finally:
# Since we may exit via an exception, close fp explicitly.
if fp:
fp.close()
numpy = loadm("numpy", "/home/username/py-virtual27/lib/python2.7/site-packages/")
scipy = loadm("scipy", "/home/username/py-virtual27/lib/python2.7/site-packages/")
I guess you also better check that your PYTHONPATH is empty or small, because that can also increase the loading time.
Python 2 looks for modules as relative to the current package first. If your library code has a lot of imports for a lot of top-level modules those are all looked up as relative first. So, if package foo.bar
import os
, then Python first looks for foo/bar/os.py
. This miss is cached by Python itself too.
In Python 3, the default has moved to absolute imports instead; you can switch Python 2.5 and up to use absolute imports per module with:
from __future__ import absolute_import
Another source of file lookup misses is loading .pyc
bytecode cache files; if those are missing for some reason (filesystem not writable for the current Python process) then Python will continue to look for those on every run. You can create these caches with the compileall
module:
python -m compileall /path/to/directory/with/pythoncode
provided you run that with the correct write permissions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With