Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does a large and expansive PYTHONPATH affect performance?

Let's say you have a project that has several levels of folders going on and in various places, to make import calls cleaner, people have amended the PYTHONPATH for the whole project.

This means that instead of saying:

from folder1.folder2.folder3 import foo

they can now say

from folder3 import foo

and add folder1/folder2 to the PYTHONPATH. The question here is, if you keep this up, and have a large number of paths added to PYTHONPATH, does that have an appreciable or significant performance hit?

To add some sense of scale, in terms of performance, I'm asking in terms of milliseconds at a minimum (ie: 100 ms? 500 ms?)

like image 962
apanzerj Avatar asked Jan 27 '23 20:01

apanzerj


1 Answers

So the performance trade-off between having a lot of different directories in your PYTHONPATH and having deeply-nested package structures will be seen in the system calls. So assuming we have the following directory structures:

bash-3.2$ tree a
a
└── b
    └── c
        └── d
            └── __init__.py
bash-3.2$ tree e
e
├── __init__.py
├── __init__.pyc
└── f
    ├── __init__.py
    ├── __init__.pyc
    └── g
        ├── __init__.py
        ├── __init__.pyc
        └── h
            ├── __init__.py
            └── __init__.pyc

We can use these structures and the strace program to compare and contrast the system calls that we generate for the following commands:

strace python -c 'from e.f.g import h'
PYTHONPATH="./a/b/c:$PYTHONPATH" strace python -c 'import d'

Many PYTHONPATH Entries

So the trade-off here is really system calls at start-up time, versus system calls at import time. For each entry in PYTHONPATH, python first checks to see if the directory exists:

stat("./a/b/c", {st_mode=S_IFDIR|0776, st_size=4096, ...}) = 0
stat("./a/b/c", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0

If the directory exists (it does ... indicated by the 0 on the right), Python will search for a number of modules when the interpreter starts. For each module it checks:

stat("./a/b/c/site", 0x7ffd900baaf0)    = -1 ENOENT (No such file or directory)
open("./a/b/c/site.x86_64-linux-gnu.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("./a/b/c/site.so", O_RDONLY)       = -1 ENOENT (No such file or directory)
open("./a/b/c/sitemodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("./a/b/c/site.py", O_RDONLY)       = -1 ENOENT (No such file or directory)
open("./a/b/c/site.pyc", O_RDONLY)      = -1 ENOENT (No such file or directory)

Each of these fails, and it moves on to the next entry in the path searching for the module to order. My 3.5 intepretter looked up 25 modules this way, producing an incremental 152 system calls on start-up per new PYTHONPATH entry.

Deep package structure

The deep package structure pays no penalty on interpreter start-up, but when we import from the deeply nested package structure we do see a difference. As a baseline, here is the simple import of d/__init__.py from the a/b/c directory in our PYTHONPATH:

stat("/home/matt/a/b/c/d", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("/home/matt/a/b/c/d/__init__.py", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
stat("/home/matt/a/b/c/d/__init__", 0x7ffd900ba990) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.x86_64-linux-gnu.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__module.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.py", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
open("/home/matt/a/b/c/d/__init__.pyc", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0664, st_size=117, ...}) = 0
read(4, "\3\363\r\n\17\3105[c\0\0\0\0\0\0\0\0\1\0\0\0@\0\0\0s\4\0\0\0d\0"..., 4096) = 117
fstat(4, {st_mode=S_IFREG|0664, st_size=117, ...}) = 0
read(4, "", 4096)                       = 0
close(4)                                = 0
close(3)                                = 0

Basically what this is doing is looking for the d package or module. When it finds d/__init__.py it opens it, and then opens d/__init__.pyc and reads the contents into memory before closing both files.

With our deeply nested package structure we have to repeat this operation 3 additional times, which is good for 15 system calls per directory for a total of 45 more system calls. While this is less than half the number of calls added by the addition of a path to our PYTHONPATH, the read calls could potentially be more time-consuming than other system calls (or require more system calls) depending on the size of the __init__.py files.

TL;DR

Taking this all into consideration, these differences are almost certainly not material enough to outweigh the design benefits of your desired solution.

This is especially true if your processes are long-running (like a web-app) rather than being short-lived.

We can reduce the system calls by:

  1. Removing any extraneous PYTHONPATH entries
  2. Pre-compile your .pyc files to avoid needing to write them
  3. Keep your package structure flat

We could more drastically improve performance by removing your py files so they aren't read for debugging purposes along with your PYC files ... but this seems like a step too far to me.

Hope this is useful, it's probably a far deeper dive than is necessary.

like image 106
Matthew Story Avatar answered Jan 30 '23 09:01

Matthew Story