UPDATE
Sorry guys but the pervious number is INACCURATE. When I tested the previous code I used tqdm
to see the expected time and the function will hurt the performance when the iterable object is very long. So I get 18.22s, which is 9 times longer than 2.43s.
HOWEVER, the conclusion is CONSISTENT: The nested loop is much FASTER. When the iteration time is 100^5, the difference is significant: 321.49 vs 210.05. There is about 1.53-time gap between them. Generally, we don't face this kind of long iteration, I'm just curious to know reason of the anomalistic situation.
My python version is 3.7.3. I use 13-inch MacBookPro2019 with 2.4 GHz Intel Core i5. The OS is macOS Mojave 10.14.6
I found a weird situation in python as follows:
import time
start = time.time()
# flattened loop
for i in range(100**4):
pass
print(time.time() - start) # 18.22(Wrong! Should be 3.09)
# nested loop
start = time.time()
for i in range(100):
for j in range(100):
for k in range(100):
for l in range(100):
pass
print(time.time() - start) # 2.43
The two kinds of loops above have the same iteration times. However the nested loop(2.43s) is running much faster that the flattened one(18.22s). The difference is bigger with the increasing of the iteration time. Hoes does this happen?
Using Pure Python We can see that in the case of nested loops, list comprehensions are faster than the ordinary for loops, which are faster than while.
NESTED LOOPS joins have an advantage over other join methods in that they can quickly retrieve the first few rows of the result set without having to wait for the entire result set to be determined.
Since nested loops perform at the rate of the amount of data input there is squared (O(N²) in Big O notation), it is not the most efficient.
Nested loops are significantly slower; avoid them when a loop has a large number of iterations to perform. Decreasing the amount of work done per iteration and the number of loops increases loop performance. Performance is not the only thing that matters. Code readability and maintainability are key.
Firstly, that is not a reliable way of measuring code execution. Let us consider this code instead (to be runned in IPython), which does not include the power calculation in the loop, and has some computation just to make sure that it cannot be "optimized" to "please do nothing".
def flattened_loop(n):
x = 0
for i in range(n):
x += 1
return x
def nested4_loop(n):
x = 0
for i in range(n):
for j in range(n):
for k in range(n):
for l in range(n):
x += 1
return x
print(f'{"n":>4s} {"n ** 4":>16s} {"flattened":>18s} {"nested4":>18s}')
for n in range(10, 120, 10):
t1 = %timeit -q -o flattened_loop(n)
t2 = %timeit -q -o nested4_loop(n)
print(f'{n:4} {n**4:16} {t1.best * 1e3:15.3f} ms {t2.best * 1e3:15.3f} ms')
n n ** 4 flattened nested4
10 10000 0.526 ms 0.653 ms
20 160000 8.561 ms 8.459 ms
30 810000 43.077 ms 39.417 ms
40 2560000 136.709 ms 121.422 ms
50 6250000 331.748 ms 291.132 ms
60 12960000 698.014 ms 599.228 ms
70 24010000 1280.681 ms 1081.062 ms
80 40960000 2187.500 ms 1826.629 ms
90 65610000 3500.463 ms 2942.909 ms
100 100000000 5349.721 ms 4437.965 ms
110 146410000 7835.733 ms 6474.588 ms
which shows that a difference does exists and it is larger for larger n
.
Is the first one running more bytecode? No. We can clearly see this through dis
:
flattened_loop()
import dis
dis.dis(flattened_loop)
2 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (x)
3 4 SETUP_LOOP 24 (to 30)
6 LOAD_GLOBAL 0 (range)
8 LOAD_FAST 0 (n)
10 CALL_FUNCTION 1
12 GET_ITER
>> 14 FOR_ITER 12 (to 28)
16 STORE_FAST 2 (i)
4 18 LOAD_FAST 1 (x)
20 LOAD_CONST 2 (1)
22 INPLACE_ADD
24 STORE_FAST 1 (x)
26 JUMP_ABSOLUTE 14
>> 28 POP_BLOCK
5 >> 30 LOAD_FAST 1 (x)
32 RETURN_VALUE
nested4_loop()
dis.dis(nested4_loop)
9 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (x)
10 4 SETUP_LOOP 78 (to 84)
6 LOAD_GLOBAL 0 (range)
8 LOAD_FAST 0 (n)
10 CALL_FUNCTION 1
12 GET_ITER
>> 14 FOR_ITER 66 (to 82)
16 STORE_FAST 2 (i)
11 18 SETUP_LOOP 60 (to 80)
20 LOAD_GLOBAL 0 (range)
22 LOAD_FAST 0 (n)
24 CALL_FUNCTION 1
26 GET_ITER
>> 28 FOR_ITER 48 (to 78)
30 STORE_FAST 3 (j)
12 32 SETUP_LOOP 42 (to 76)
34 LOAD_GLOBAL 0 (range)
36 LOAD_FAST 0 (n)
38 CALL_FUNCTION 1
40 GET_ITER
>> 42 FOR_ITER 30 (to 74)
44 STORE_FAST 4 (k)
13 46 SETUP_LOOP 24 (to 72)
48 LOAD_GLOBAL 0 (range)
50 LOAD_FAST 0 (n)
52 CALL_FUNCTION 1
54 GET_ITER
>> 56 FOR_ITER 12 (to 70)
58 STORE_FAST 5 (l)
14 60 LOAD_FAST 1 (x)
62 LOAD_CONST 2 (1)
64 INPLACE_ADD
66 STORE_FAST 1 (x)
68 JUMP_ABSOLUTE 56
>> 70 POP_BLOCK
>> 72 JUMP_ABSOLUTE 42
>> 74 POP_BLOCK
>> 76 JUMP_ABSOLUTE 28
>> 78 POP_BLOCK
>> 80 JUMP_ABSOLUTE 14
>> 82 POP_BLOCK
15 >> 84 LOAD_FAST 1 (x)
86 RETURN_VALUE
Are the numbers in the single loops getting too big? No.
import sys
print([(n, sys.getsizeof(n), n ** 4, sys.getsizeof(n ** 4)) for n in (10, 110)])
# [(10, 28, 10000, 28), (110, 28, 146410000, 28)]
Is the power operation (not timed in my code, but timed in yours) explaining the timing difference (timed only once because constant computations get cached in Python)? No.
%timeit -r1 -n1 100 ** 4
# loop, best of 1: 708 ns per loop
So, what is happening?
At this point this is just speculation, but, given that we have ruled out at least some of the potential candidates, I believe that this is due some caching mechanism that is taking place in the nested version. Such caching is probably in place to speed up the notoriously comparatively slow explicit looping.
If we repeat the same test with Numba compilation (where loops gets lifted, i.e. executed without the boilerplate required by Python to ensure its dynamism), we do actually get:
import numba as nb
@nb.jit
def flattened_loop_nb(n):
x = 0
for i in range(n):
x += 1
return x
@nb.jit
def nested4_loop_nb(n):
x = 0
for i in range(n):
for j in range(n):
for k in range(n):
for l in range(n):
x += 1
return x
flattened_loop_nb(100) # trigger compilation
nested4_loop_nb(100) # trigger compilation
print(f'{"n":>4s} {"n ** 4":>16s} {"flattened":>18s} {"nested4":>18s}')
for n in range(10, 120, 10):
m = n ** 4
t1 = %timeit -q -o flattened_loop_nb(m)
t2 = %timeit -q -o nested4_loop_nb(n)
print(f'{n:4} {n**4:16} {t1.best * 1e6:15.3f} µs {t2.best * 1e6:15.3f} µs')
n n ** 4 flattened nested4
10 10000 0.195 µs 0.199 µs
20 160000 0.196 µs 0.201 µs
30 810000 0.196 µs 0.200 µs
40 2560000 0.195 µs 0.197 µs
50 6250000 0.193 µs 0.199 µs
60 12960000 0.195 µs 0.199 µs
70 24010000 0.197 µs 0.200 µs
80 40960000 0.195 µs 0.199 µs
90 65610000 0.194 µs 0.197 µs
100 100000000 0.195 µs 0.199 µs
110 146410000 0.194 µs 0.199 µs
Slightly slower (but largely independent on n
) execution speed for the nested loops (as expected).
2.43s is a certainly a bit too small compared to 18.22s. Surprisingly, the nested loop does seem to be slightly faster on my machine all the time. The cause could be that it's machine dependent.
Another possible reason could be that in the first loop, very large numbers have to be assigned to the iterating variable while in the second loop they are all simply within 100. The given two loops run at 4.2s and 3.9s on my machine; and increasing a zero to 1e9, it takes 43.3s and 39.2s respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With