Why is there a performance difference between the order of a nested loop?

Question

I have a process that loops through two lists, one being relatively large while the other being significantly smaller.

Example:

larger_list = list(range(15000))
smaller_list = list(range(2500))

for ll in larger_list:
    for sl in smaller_list:            
        pass

I scaled the sized down of the lists to test performance, and I noticed there is a decent difference between which list is looped through first.

import timeit

larger_list = list(range(150))
smaller_list = list(range(25))


def large_then_small():
    for ll in larger_list:
        for sl in smaller_list:
            pass


def small_then_large():
    for sl in smaller_list:
        for ll in larger_list:
            pass


print('Larger -> Smaller: {}'.format(timeit.timeit(large_then_small)))
print('Smaller -> Larger: {}'.format(timeit.timeit(small_then_large)))

>>> Larger -> Smaller: 114.884992572
>>> Smaller -> Larger: 98.7751009799

At first glance, they look identical - however there is 16 second difference between the two functions.

Why is that?

Ilja Everilä · Accepted Answer

This same phenomenon was under discussion in this duplicate and got me interested in what goes on in the C land of CPython. Built python with:

% ./configure --enable-profiling
% make coverage

Tests

% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for sl in smaller_list:
    for ll in larger_list: 
        pass"
% mv gmon.out soflgmon.out

% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for ll in larger_list:
    for sl in smaller_list:
        pass"
% mv gmon.out lofsgmon.out

Results

Short list of long lists (total time for a single run 1.60):

% gprof python soflgmon.out|head -n40
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 46.25      0.74     0.74     3346     0.00     0.00  PyEval_EvalFrameEx
 25.62      1.15     0.41 37518735     0.00     0.00  insertdict
 14.38      1.38     0.23 37555121     0.00     0.00  lookdict_unicode_nodummy
  7.81      1.50     0.12 37506675     0.00     0.00  listiter_next
  4.06      1.57     0.07 37516233     0.00     0.00  PyDict_SetItem
  0.62      1.58     0.01     2095     0.00     0.00  _PyEval_EvalCodeWithName
  0.62      1.59     0.01        3     0.00     0.00  untrack_dicts
  0.31      1.59     0.01                             _PyDict_SetItem_KnownHash
  0.31      1.60     0.01                             listiter_len
  0.00      1.60     0.00    87268     0.00     0.00  visit_decref
  0.00      1.60     0.00    73592     0.00     0.00  visit_reachable
  0.00      1.60     0.00    71261     0.00     0.00  _PyThreadState_UncheckedGet
  0.00      1.60     0.00    49742     0.00     0.00  _PyObject_Alloc
  0.00      1.60     0.00    48922     0.00     0.00  PyObject_Malloc
  0.00      1.60     0.00    48922     0.00     0.00  _PyObject_Malloc
  0.00      1.60     0.00    47487     0.00     0.00  PyDict_GetItem
  0.00      1.60     0.00    44246     0.00     0.00  _PyObject_Free
  0.00      1.60     0.00    43637     0.00     0.00  PyObject_Free
  0.00      1.60     0.00    30034     0.00     0.00  slotptr
  0.00      1.60     0.00    24892     0.00     0.00  type_is_gc
  0.00      1.60     0.00    24170     0.00     0.00  r_byte
  0.00      1.60     0.00    23774     0.00     0.00  PyErr_Occurred
  0.00      1.60     0.00    20371     0.00     0.00  _PyType_Lookup
  0.00      1.60     0.00    19930     0.00     0.00  PyLong_FromLong
  0.00      1.60     0.00    19758     0.00     0.00  r_string
  0.00      1.60     0.00    19080     0.00     0.00  _PyLong_New
  0.00      1.60     0.00    18887     0.00     0.00  lookdict_unicode
  0.00      1.60     0.00    18878     0.00     0.00  long_dealloc
  0.00      1.60     0.00    17639     0.00     0.00  PyUnicode_InternInPlace
  0.00      1.60     0.00    17502     0.00     0.00  rangeiter_next
  0.00      1.60     0.00    14776     0.00     0.00  PyObject_GC_UnTrack
  0.00      1.60     0.00    14578     0.00     0.00  descr_traverse
  0.00      1.60     0.00    13520     0.00     0.00  r_long
  0.00      1.60     0.00    13058     0.00     0.00  PyUnicode_New
  0.00      1.60     0.00    12298     0.00     0.00  _Py_CheckFunctionResult
  ...

Long list of short lists (total time for a single run 1.64):

gprof python lofsgmon.out|head -n40
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 48.78      0.80     0.80     3346     0.00     0.00  PyEval_EvalFrameEx
 17.99      1.09     0.29 37531168     0.00     0.00  insertdict
 11.59      1.28     0.19 37531675     0.00     0.00  listiter_next
 11.28      1.47     0.18 37580156     0.00     0.00  lookdict_unicode_nodummy
  6.71      1.58     0.11 37528666     0.00     0.00  PyDict_SetItem
  1.22      1.60     0.02                             _PyDict_SetItem_KnownHash
  0.61      1.61     0.01     5525     0.00     0.00  update_one_slot
  0.61      1.62     0.01      120     0.00     0.00  PyDict_Merge
  0.30      1.62     0.01    18178     0.00     0.00  lookdict_unicode
  0.30      1.63     0.01    11988     0.00     0.00  insertdict_clean
  0.30      1.64     0.01                             listiter_len
  0.30      1.64     0.01                             listiter_traverse
  0.00      1.64     0.00    96089     0.00     0.00  _PyThreadState_UncheckedGet
  0.00      1.64     0.00    87245     0.00     0.00  visit_decref
  0.00      1.64     0.00    74743     0.00     0.00  visit_reachable
  0.00      1.64     0.00    62232     0.00     0.00  _PyObject_Alloc
  0.00      1.64     0.00    61412     0.00     0.00  PyObject_Malloc
  0.00      1.64     0.00    61412     0.00     0.00  _PyObject_Malloc
  0.00      1.64     0.00    59815     0.00     0.00  PyDict_GetItem
  0.00      1.64     0.00    55231     0.00     0.00  _PyObject_Free
  0.00      1.64     0.00    54622     0.00     0.00  PyObject_Free
  0.00      1.64     0.00    36274     0.00     0.00  PyErr_Occurred
  0.00      1.64     0.00    30034     0.00     0.00  slotptr
  0.00      1.64     0.00    24929     0.00     0.00  type_is_gc
  0.00      1.64     0.00    24617     0.00     0.00  _PyObject_GC_Alloc
  0.00      1.64     0.00    24617     0.00     0.00  _PyObject_GC_Malloc
  0.00      1.64     0.00    24170     0.00     0.00  r_byte
  0.00      1.64     0.00    20958     0.00     0.00  PyObject_GC_Del
  0.00      1.64     0.00    20371     0.00     0.00  _PyType_Lookup
  0.00      1.64     0.00    19918     0.00     0.00  PyLong_FromLong
  0.00      1.64     0.00    19758     0.00     0.00  r_string
  0.00      1.64     0.00    19068     0.00     0.00  _PyLong_New
  0.00      1.64     0.00    18845     0.00     0.00  long_dealloc
  0.00      1.64     0.00    18507     0.00     0.00  _PyObject_GC_New
  0.00      1.64     0.00    17639     0.00     0.00  PyUnicode_InternInPlace
  ...

The difference is marginal (2.4%), and profiling adds to run time, so it is difficult to say how much it actually would've been. The total time also includes the creation of the test lists, so that hides the true difference further.

The reason for the 16s difference in the original test is that timeit.timeit runs the given statement or function number=1000000 times by default, so that would add up to a whopping 40,000s in this case. Don't quote that value though, as it is an artifact of profiling. With your original test code and non profiling python3 on this machine I get:

Larger -> Smaller: 40.29234626500056
Smaller -> Larger: 33.09413992699956

which would mean a difference of

In [1]: (40.29234626500056-33.09413992699956)/1000000
Out[1]: 7.198206338001e-06

per single run (7.2µs), 18% in total.

So as stated in the former answer, POP_BLOCK gets executed more, but it's not just that, but the whole inner loop setup:

  0.00      1.64     0.00    16521     0.00     0.00  PyFrame_BlockSetup
  0.00      1.64     0.00    16154     0.00     0.00  PyFrame_BlockPop

Compared to the short list of long lists:

  0.00      1.60     0.00     4021     0.00     0.00  PyFrame_BlockSetup
  0.00      1.60     0.00     3748     0.00     0.00  set_next
  0.00      1.60     0.00     3654     0.00     0.00  PyFrame_BlockPop

That has negligible impact though.

Why is there a performance difference between the order of a nested loop?

Tags:

python

loops

python-2.x

Wondercricket

1 Answers

Tests

Results

Ilja Everilä

Recent Activity

Donate For Us

Why is there a performance difference between the order of a nested loop?

Tags:

python

loops

python-2.x

Wondercricket

1 Answers

Tests

Results

Ilja Everilä

Related questions

Recent Activity

Donate For Us