Why is it that executing a set of commands in a function:
def main():
[do stuff]
return something
print(main())
will tend to run 1.5x
to 3x
times faster in python than executing commands in the top level:
[do stuff]
print(something)
A global variable is a variable that is accessible globally. A local variable is one that is only accessible to the current scope, such as temporary variables used in a single function definition.
Locals should be faster When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order: local namespace - specific to the current function or class method.
While in many or most other programming languages variables are treated as global if not declared otherwise, Python deals with variables the other way around. They are local, if not otherwise declared. The driving reason behind this approach is that global variables are generally bad practice and should be avoided.
An important difference between nonlocal and global is that the a nonlocal variable must have been already bound in the enclosing namespace (otherwise an syntaxError will be raised) while a global declaration in a local scope does not require the variable is pre-bound (it will create a new binding in the global ...
The difference does indeed greatly depend on what "do stuff" actually does and mainly on how many times it accesses names that are defined/used. Granted that the code is similar, there is a fundamental difference between these two cases:
LOAD_FAST
/STORE_FAST
.LOAD_NAME
/STORE_NAME
which are more sluggish. This can be viewed in the following cases, I'll be using a for
loop to make sure that lookups for variables defined is performed multiple times.
Function and LOAD_FAST/STORE_FAST
:
We define a simple function that does some really silly things:
def main():
b = 20
for i in range(1000000): z = 10 * b
return z
Output generated by dis.dis
:
dis.dis(main)
# [/snipped output/]
18 GET_ITER
>> 19 FOR_ITER 16 (to 38)
22 STORE_FAST 1 (i)
25 LOAD_CONST 3 (10)
28 LOAD_FAST 0 (b)
31 BINARY_MULTIPLY
32 STORE_FAST 2 (z)
35 JUMP_ABSOLUTE 19
>> 38 POP_BLOCK
# [/snipped output/]
The thing to note here is the LOAD_FAST/STORE_FAST
commands at the offsets 28
and 32
, these are used to access the b
name used in the BINARY_MULTIPLY
operation and store the z
name, respectively. As their byte code name implies, they are the fast version of the LOAD_*/STORE_*
family.
Modules and LOAD_NAME/STORE_NAME
:
Now, let's look at the output of dis
for our module version of the previous function:
# compile the module
m = compile(open('main.py', 'r').read(), "main", "exec")
dis.dis(m)
# [/snipped output/]
18 GET_ITER
>> 19 FOR_ITER 16 (to 38)
22 STORE_NAME 2 (i)
25 LOAD_NAME 3 (z)
28 LOAD_NAME 0 (b)
31 BINARY_MULTIPLY
32 STORE_NAME 3 (z)
35 JUMP_ABSOLUTE 19
>> 38 POP_BLOCK
# [/snipped output/]
Over here we have multiple calls to LOAD_NAME/STORE_NAME
, which, as mentioned previously, are more sluggish commands to execute.
In this case, there is going to be a clear difference in execution time, mainly because Python must evaluate LOAD_NAME/STORE_NAME
and LOAD_FAST/STORE_FAST
multiple times (due to the for
loop I added) and, as a result, the overhead introduced each time the code for each byte code is executed will accumulate.
Timing the execution 'as a module':
start_time = time.time()
b = 20
for i in range(1000000): z = 10 *b
print(z)
print("Time: ", time.time() - start_time)
200
Time: 0.15162253379821777
Timing the execution as a function:
start_time = time.time()
print(main())
print("Time: ", time.time() - start_time)
200
Time: 0.08665871620178223
If you time
loops in a smaller range
(for example for i in range(1000)
) you'll notice that the 'module' version is faster. This happens because the overhead introduced by needing to call function main()
is larger than that introduced by *_FAST
vs *_NAME
differences. So it's largely relative to the amount of work that is done.
So, the real culprit here, and the reason why this difference is evident, is the for
loop used. You generally have 0
reason to ever put an intensive loop like that one at the top level of your script. Move it in a function and avoid using global variables, it is designed to be more efficient.
You can take a look at the code executed for each of the byte code. I'll link the source for the 3.5
version of Python here even though I'm pretty sure 2.7
doesn't differ much. Bytecode evaluation is done in Python/ceval.c
specifically in function PyEval_EvalFrameEx
:
LOAD_FAST source
- STORE_FAST source
LOAD_NAME source
- STORE_NAME source
As you'll see, the *_FAST
bytecodes simply get the value stored/loaded using a fastlocals
local symbol table contained inside frame objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With