Why does a nested loop perform much faster than the flattened one?

Tags:

Sorry guys but the pervious number is INACCURATE. When I tested the previous code I used tqdm to see the expected time and the function will hurt the performance when the iterable object is very long. So I get 18.22s, which is 9 times longer than 2.43s.

HOWEVER, the conclusion is CONSISTENT: The nested loop is much FASTER. When the iteration time is 100^5, the difference is significant: 321.49 vs 210.05. There is about 1.53-time gap between them. Generally, we don't face this kind of long iteration, I'm just curious to know reason of the anomalistic situation. enter image description here

My python version is 3.7.3. I use 13-inch MacBookPro2019 with 2.4 GHz Intel Core i5. The OS is macOS Mojave 10.14.6

I found a weird situation in python as follows:

import time

start = time.time()
# flattened loop
for i in range(100**4):
    pass
print(time.time() - start) # 18.22(Wrong! Should be 3.09)

# nested loop
start = time.time()
for i in range(100):
    for j in range(100):
        for k in range(100):
            for l in range(100):
                pass
print(time.time() - start) # 2.43

The two kinds of loops above have the same iteration times. However the nested loop(2.43s) is running much faster that the flattened one(18.22s). The difference is bigger with the increasing of the iteration time. Hoes does this happen?

605

asked May 29 '20 16:05

Laputa

2 Answers

Firstly, that is not a reliable way of measuring code execution. Let us consider this code instead (to be runned in IPython), which does not include the power calculation in the loop, and has some computation just to make sure that it cannot be "optimized" to "please do nothing".

def flattened_loop(n):
    x = 0
    for i in range(n):
        x += 1
    return x


def nested4_loop(n):
    x = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                for l in range(n):
                    x += 1
    return x


print(f'{"n":>4s}  {"n ** 4":>16s}  {"flattened":>18s}  {"nested4":>18s}')
for n in range(10, 120, 10):
    t1 = %timeit -q -o flattened_loop(n)
    t2 = %timeit -q -o nested4_loop(n)
    print(f'{n:4}  {n**4:16}  {t1.best * 1e3:15.3f} ms  {t2.best * 1e3:15.3f} ms')

   n            n ** 4           flattened             nested4
  10             10000            0.526 ms            0.653 ms
  20            160000            8.561 ms            8.459 ms
  30            810000           43.077 ms           39.417 ms
  40           2560000          136.709 ms          121.422 ms
  50           6250000          331.748 ms          291.132 ms
  60          12960000          698.014 ms          599.228 ms
  70          24010000         1280.681 ms         1081.062 ms
  80          40960000         2187.500 ms         1826.629 ms
  90          65610000         3500.463 ms         2942.909 ms
 100         100000000         5349.721 ms         4437.965 ms
 110         146410000         7835.733 ms         6474.588 ms

which shows that a difference does exists and it is larger for larger n.

Is the first one running more bytecode? No. We can clearly see this through dis:

flattened_loop()

import dis


dis.dis(flattened_loop)

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

  3           4 SETUP_LOOP              24 (to 30)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_FAST                0 (n)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                12 (to 28)
             16 STORE_FAST               2 (i)

  4          18 LOAD_FAST                1 (x)
             20 LOAD_CONST               2 (1)
             22 INPLACE_ADD
             24 STORE_FAST               1 (x)
             26 JUMP_ABSOLUTE           14
        >>   28 POP_BLOCK

  5     >>   30 LOAD_FAST                1 (x)
             32 RETURN_VALUE

nested4_loop()

dis.dis(nested4_loop)

  9           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

 10           4 SETUP_LOOP              78 (to 84)
              6 LOAD_GLOBAL              0 (range)
              8 LOAD_FAST                0 (n)
             10 CALL_FUNCTION            1
             12 GET_ITER
        >>   14 FOR_ITER                66 (to 82)
             16 STORE_FAST               2 (i)

 11          18 SETUP_LOOP              60 (to 80)
             20 LOAD_GLOBAL              0 (range)
             22 LOAD_FAST                0 (n)
             24 CALL_FUNCTION            1
             26 GET_ITER
        >>   28 FOR_ITER                48 (to 78)
             30 STORE_FAST               3 (j)

 12          32 SETUP_LOOP              42 (to 76)
             34 LOAD_GLOBAL              0 (range)
             36 LOAD_FAST                0 (n)
             38 CALL_FUNCTION            1
             40 GET_ITER
        >>   42 FOR_ITER                30 (to 74)
             44 STORE_FAST               4 (k)

 13          46 SETUP_LOOP              24 (to 72)
             48 LOAD_GLOBAL              0 (range)
             50 LOAD_FAST                0 (n)
             52 CALL_FUNCTION            1
             54 GET_ITER
        >>   56 FOR_ITER                12 (to 70)
             58 STORE_FAST               5 (l)

 14          60 LOAD_FAST                1 (x)
             62 LOAD_CONST               2 (1)
             64 INPLACE_ADD
             66 STORE_FAST               1 (x)
             68 JUMP_ABSOLUTE           56
        >>   70 POP_BLOCK
        >>   72 JUMP_ABSOLUTE           42
        >>   74 POP_BLOCK
        >>   76 JUMP_ABSOLUTE           28
        >>   78 POP_BLOCK
        >>   80 JUMP_ABSOLUTE           14
        >>   82 POP_BLOCK

 15     >>   84 LOAD_FAST                1 (x)
             86 RETURN_VALUE

Are the numbers in the single loops getting too big? No.

import sys


print([(n, sys.getsizeof(n), n ** 4, sys.getsizeof(n ** 4)) for n in (10, 110)])
# [(10, 28, 10000, 28), (110, 28, 146410000, 28)]

Is the power operation (not timed in my code, but timed in yours) explaining the timing difference (timed only once because constant computations get cached in Python)? No.

%timeit -r1 -n1 100 ** 4
# loop, best of 1: 708 ns per loop

So, what is happening?

At this point this is just speculation, but, given that we have ruled out at least some of the potential candidates, I believe that this is due some caching mechanism that is taking place in the nested version. Such caching is probably in place to speed up the notoriously comparatively slow explicit looping.

If we repeat the same test with Numba compilation (where loops gets lifted, i.e. executed without the boilerplate required by Python to ensure its dynamism), we do actually get:

import numba as nb


@nb.jit
def flattened_loop_nb(n):
    x = 0
    for i in range(n):
        x += 1
    return x


@nb.jit
def nested4_loop_nb(n):
    x = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                for l in range(n):
                    x += 1
    return x


flattened_loop_nb(100)  # trigger compilation
nested4_loop_nb(100)  # trigger compilation


print(f'{"n":>4s}  {"n ** 4":>16s}  {"flattened":>18s}  {"nested4":>18s}')
for n in range(10, 120, 10):
    m = n ** 4
    t1 = %timeit -q -o flattened_loop_nb(m)
    t2 = %timeit -q -o nested4_loop_nb(n)
    print(f'{n:4}  {n**4:16}  {t1.best * 1e6:15.3f} µs  {t2.best * 1e6:15.3f} µs')

   n            n ** 4           flattened             nested4
  10             10000            0.195 µs            0.199 µs
  20            160000            0.196 µs            0.201 µs
  30            810000            0.196 µs            0.200 µs
  40           2560000            0.195 µs            0.197 µs
  50           6250000            0.193 µs            0.199 µs
  60          12960000            0.195 µs            0.199 µs
  70          24010000            0.197 µs            0.200 µs
  80          40960000            0.195 µs            0.199 µs
  90          65610000            0.194 µs            0.197 µs
 100         100000000            0.195 µs            0.199 µs
 110         146410000            0.194 µs            0.199 µs

Slightly slower (but largely independent on n) execution speed for the nested loops (as expected).

186

answered Sep 20 '22 11:09

norok2

2.43s is a certainly a bit too small compared to 18.22s. Surprisingly, the nested loop does seem to be slightly faster on my machine all the time. The cause could be that it's machine dependent.

Another possible reason could be that in the first loop, very large numbers have to be assigned to the iterating variable while in the second loop they are all simply within 100. The given two loops run at 4.2s and 3.9s on my machine; and increasing a zero to 1e9, it takes 43.3s and 39.2s respectively.

answered Sep 24 '22 11:09

Mercury

Related questions
                            
                                Plotly: How to combine make_subplots() and ff.create_distplot()?
                            
                                pip does not install my package dependencies
                            
                                Could not find a version that satisfies the requirement torch==1.3.1
                            
                                How to set environment variable TF_Keras = 1 for onnx conversion?
                            
                                I got: chart_studio.exceptions.PlotlyRequestError: Authentication credentials were not provided
                            
                                Django Admin Login page - Python Crashes
                            
                                Given a list of numbers, find all matrices such that each column and row sum up to 264
                            
                                Error when upgrading Spyder to 4.0.1: ModuleNotFoundError: No module named 'IPython.core.inputtransformer2'
                            
                                Python - Detect a QR code from an image and crop using OpenCV
                            
                                Accessing dict "values" like an attribute?
                            
                                How can I pass a callback to re.sub, but still inserting match captures?
                            
                                How can I list all the virtual environments created with venv?
                            
                                Implementing a trainable generalized Bump function layer in Keras/Tensorflow
                            
                                Named Entity Recognition in aspect-opinion extraction using dependency rule matching
                            
                                How to extract specific columns without index no. and with all the rows in python dataframe?
                            
                                rio.plot.show with colorbar?
                            
                                How does mypy use typing.TYPE_CHECKING to resolve the circular import annotation problem?
                            
                                Number of combinations less than 100
                            
                                dash app refusing to start: '127.0.0.1 refused to connect.'
                            
                                Word Cloud built out of TF-IDF Vectorizer function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does a nested loop perform much faster than the flattened one?

Tags:

python

list

Laputa

People also ask

2 Answers

norok2

Mercury

Recent Activity

Donate For Us