Why is it that executing a set of commands in a function: <pre class="prettyprint"><code>def main(): [do stuff] return something print(main()) </code></pre> will tend to run <code>1.5x</code> to <code>3x</code> times faster in python than executing commands in the top level: <pre class="prettyprint"><code>[do stuff] print(something) </code></pre>

The difference does indeed greatly depend on what "do stuff" actually does and mainly on how many times it accesses names that are defined/used. Granted that the code is similar, there is a fundamental difference between these two cases: <ul> <li>In functions, the byte code for loading/storing names is done with <code>LOAD_FAST</code>/<code>STORE_FAST</code>.</li> <li>In the top level scope (i.e module), the same commands are performed with <code>LOAD_NAME</code>/<code>STORE_NAME</code> which are more sluggish. </li> </ul> This can be viewed in the following cases, I'll be using a <code>for</code> loop to make sure that lookups for variables defined is performed multiple times. Function and <code>LOAD_FAST/STORE_FAST</code>: We define a simple function that does some really silly things: <pre class="prettyprint"><code>def main(): b = 20 for i in range(1000000): z = 10 * b return z </code></pre> Output generated by <code>dis.dis</code>: <pre class="prettyprint"><code>dis.dis(main) # [/snipped output/] 18 GET_ITER >> 19 FOR_ITER 16 (to 38) 22 STORE_FAST 1 (i) 25 LOAD_CONST 3 (10) 28 LOAD_FAST 0 (b) 31 BINARY_MULTIPLY 32 STORE_FAST 2 (z) 35 JUMP_ABSOLUTE 19 >> 38 POP_BLOCK # [/snipped output/] </code></pre> The thing to note here is the <code>LOAD_FAST/STORE_FAST</code> commands at the offsets <code>28</code> and <code>32</code>, these are used to access the <code>b</code> name used in the <code>BINARY_MULTIPLY</code> operation and store the <code>z</code> name, respectively. As their byte code name implies, they are the fast version of the <code>LOAD_*/STORE_*</code> family. <hr> Modules and <code>LOAD_NAME/STORE_NAME</code>: Now, let's look at the output of <code>dis</code> for our module version of the previous function: <pre class="prettyprint"><code># compile the module m = compile(open('main.py', 'r').read(), "main", "exec") dis.dis(m) # [/snipped output/] 18 GET_ITER >> 19 FOR_ITER 16 (to 38) 22 STORE_NAME 2 (i) 25 LOAD_NAME 3 (z) 28 LOAD_NAME 0 (b) 31 BINARY_MULTIPLY 32 STORE_NAME 3 (z) 35 JUMP_ABSOLUTE 19 >> 38 POP_BLOCK # [/snipped output/] </code></pre> Over here we have multiple calls to <code>LOAD_NAME/STORE_NAME</code>, which, as mentioned previously, are more sluggish commands to execute. In this case, there is going to be a clear difference in execution time, mainly because Python must evaluate <code>LOAD_NAME/STORE_NAME</code> and <code>LOAD_FAST/STORE_FAST</code> multiple times (due to the <code>for</code> loop I added) and, as a result, the overhead introduced each time the code for each byte code is executed will accumulate. Timing the execution 'as a module': <pre class="prettyprint"><code>start_time = time.time() b = 20 for i in range(1000000): z = 10 *b print(z) print("Time: ", time.time() - start_time) 200 Time: 0.15162253379821777 </code></pre> Timing the execution as a function: <pre class="prettyprint"><code>start_time = time.time() print(main()) print("Time: ", time.time() - start_time) 200 Time: 0.08665871620178223 </code></pre> If you <code>time</code> loops in a smaller <code>range</code> (for example <code>for i in range(1000)</code>) you'll notice that the 'module' version is faster. This happens because the overhead introduced by needing to call function <code>main()</code> is larger than that introduced by <code>*_FAST</code> vs <code>*_NAME</code> differences. So it's largely relative to the amount of work that is done. So, the real culprit here, and the reason why this difference is evident, is the <code>for</code> loop used. You generally have <code>0</code> reason to ever put an intensive loop like that one at the top level of your script. Move it in a function and avoid using global variables, it is designed to be more efficient. <hr> You can take a look at the code executed for each of the byte code. I'll link the source for the <code>3.5</code> version of Python here even though I'm pretty sure <code>2.7</code> doesn't differ much. Bytecode evaluation is done in <code>Python/ceval.c</code> specifically in function <code>PyEval_EvalFrameEx</code>: <ul> <li> <code>LOAD_FAST source</code> - <code>STORE_FAST source</code> </li> <li> <code>LOAD_NAME source</code> - <code>STORE_NAME source</code> </li> </ul> As you'll see, the <code>*_FAST</code> bytecodes simply get the value stored/loaded using a <code>fastlocals</code> local symbol table contained inside frame objects.

global vs. local namespace performance difference

Tags:

python

optimization

python-3.x

python-2.7

Why is it that executing a set of commands in a function:

def main():
    [do stuff]
    return something
print(main())

will tend to run 1.5x to 3x times faster in python than executing commands in the top level:

[do stuff]
print(something)

286

asked Jan 11 '16 03:01

user119264

1 Answers

The difference does indeed greatly depend on what "do stuff" actually does and mainly on how many times it accesses names that are defined/used. Granted that the code is similar, there is a fundamental difference between these two cases:

In functions, the byte code for loading/storing names is done with LOAD_FAST/STORE_FAST.
In the top level scope (i.e module), the same commands are performed with LOAD_NAME/STORE_NAME which are more sluggish.

This can be viewed in the following cases, I'll be using a for loop to make sure that lookups for variables defined is performed multiple times.

Function and LOAD_FAST/STORE_FAST:

We define a simple function that does some really silly things:

def main():
    b = 20
    for i in range(1000000): z = 10 * b 
    return z

Output generated by dis.dis:

dis.dis(main)
# [/snipped output/]

             18 GET_ITER
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_FAST               1 (i)
             25 LOAD_CONST               3 (10)
             28 LOAD_FAST                0 (b)
             31 BINARY_MULTIPLY
             32 STORE_FAST               2 (z)
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK

# [/snipped output/]

The thing to note here is the LOAD_FAST/STORE_FAST commands at the offsets 28 and 32, these are used to access the b name used in the BINARY_MULTIPLY operation and store the z name, respectively. As their byte code name implies, they are the fast version of the LOAD_*/STORE_* family.

Modules and LOAD_NAME/STORE_NAME:

Now, let's look at the output of dis for our module version of the previous function:

# compile the module
m = compile(open('main.py', 'r').read(), "main", "exec")

dis.dis(m)
# [/snipped output/]

             18 GET_ITER
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_NAME               2 (i)
             25 LOAD_NAME                3 (z)
             28 LOAD_NAME                0 (b)
             31 BINARY_MULTIPLY
             32 STORE_NAME               3 (z)
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK

# [/snipped output/]

Over here we have multiple calls to LOAD_NAME/STORE_NAME, which, as mentioned previously, are more sluggish commands to execute.

In this case, there is going to be a clear difference in execution time, mainly because Python must evaluate LOAD_NAME/STORE_NAME and LOAD_FAST/STORE_FAST multiple times (due to the for loop I added) and, as a result, the overhead introduced each time the code for each byte code is executed will accumulate.

Timing the execution 'as a module':

start_time = time.time()
b = 20 
for i in range(1000000): z = 10 *b
print(z)
print("Time: ", time.time() - start_time)
200
Time:  0.15162253379821777

Timing the execution as a function:

start_time = time.time()
print(main())
print("Time: ", time.time() - start_time)
200
Time:  0.08665871620178223

If you time loops in a smaller range (for example for i in range(1000)) you'll notice that the 'module' version is faster. This happens because the overhead introduced by needing to call function main() is larger than that introduced by *_FAST vs *_NAME differences. So it's largely relative to the amount of work that is done.

So, the real culprit here, and the reason why this difference is evident, is the for loop used. You generally have 0 reason to ever put an intensive loop like that one at the top level of your script. Move it in a function and avoid using global variables, it is designed to be more efficient.

You can take a look at the code executed for each of the byte code. I'll link the source for the 3.5 version of Python here even though I'm pretty sure 2.7 doesn't differ much. Bytecode evaluation is done in Python/ceval.c specifically in function PyEval_EvalFrameEx:

LOAD_FAST source - STORE_FAST source
LOAD_NAME source - STORE_NAME source

As you'll see, the *_FAST bytecodes simply get the value stored/loaded using a fastlocals local symbol table contained inside frame objects.

195

answered Oct 19 '22 22:10

Dimitris Fasarakis Hilliard

Related questions
                            
                                Python: Modify Global list inside a function
                            
                                selenium new tab in chrome browser by python webdriver
                            
                                Making a list of months and years from DatetimeIndex in Pandas
                            
                                Get Iframe Src content using Selenium Python
                            
                                italic symbols in matplotlib?
                            
                                Django datetime.timedelta , how does its subtract from timezone.now() if they are posssibly different sets
                            
                                Python lambda function printing <function <lambda> at 0x7fcbbc740668> instead of value
                            
                                Array ArrayList python equivalent
                            
                                Difference between cv2.NORM_L2 and cv2.NORM_L1 in opencv python
                            
                                Returning AttributeError: 'int' object has no attribute 'encode'
                            
                                Add numbers and exit with a sentinel
                            
                                Kivy run function from kv button
                            
                                Finding groups of increasing numbers in a list
                            
                                Django change database field from integer to CharField
                            
                                Why are some items not translated in Odoo?
                            
                                knnMatch does not work with K != 1
                            
                                How to clone an scikit-learn estimator including its data?
                            
                                Python cannot install PyGObject
                            
                                Python 2.7 Openpyxl UserWarning
                            
                                How to Exit Linux terminal using Python script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With