Why is list(x for x in a) faster for a=[0] than for a=[]?

Tags:

I tested list(x for x in a) with three different CPython versions. On a = [0] it's significantly faster than on a = []:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

465 ns  412 ns     543 ns  515 ns     513 ns  457 ns   
450 ns  406 ns     544 ns  515 ns     506 ns  491 ns   
456 ns  408 ns     551 ns  513 ns     515 ns  487 ns   
455 ns  413 ns     548 ns  516 ns     513 ns  491 ns   
452 ns  404 ns     549 ns  511 ns     508 ns  486 ns

With tuple instead of list, it's the expected other way around:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

354 ns  405 ns     467 ns  514 ns     421 ns  465 ns   
364 ns  407 ns     467 ns  527 ns     425 ns  464 ns   
353 ns  399 ns     490 ns  549 ns     419 ns  465 ns   
352 ns  400 ns     500 ns  556 ns     414 ns  474 ns   
354 ns  405 ns     494 ns  560 ns     420 ns  474 ns

So why is list faster when it (and the underlying generator iterator) has to do more?

Tested on Windows 10 Pro 2004 64-bit.

Benchmark code:

from timeit import repeat

setups = 'a = []', 'a = [0]'
number = 10**6

print(*setups, sep='   ')
for _ in range(5):
    for setup in setups:
        t = min(repeat('list(x for x in a)', setup, number=number)) / number
        print('%d ns' % (t * 1e9), end='   ')
    print()

Byte sizes, showing that it doesn't overallocate for input [] but does for input [0]:

>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40

>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72

224

asked Oct 19 '20 13:10

Kelly Bundy

1 Answers

What you observe, is that pymalloc (Python memory manager) is faster than the memory manager provided by your C-runtime.

It is easy to see in the profiler, that the main difference between both versions is that list_resize and _PyObjectRealloc need more time for the a=[]-case. But why?

When a new list is created from an iterable, the list tries to get a hint how many elements are in the iterator:

n = PyObject_LengthHint(iterable, 8);

However, this doesn't work for generators and thus the hint is the default value 8.

After the iterator is exhausted, the list tries to shrink, because there are only 0 or 1 element (and not the original capacity allocated due to a too large size-hint). For 1 element this would lead to (due to over-allocation) capacity of 4 elements. However, there is a special handling for the case of 0 elements: it will not be over-allocated:

// ...
if (newsize == 0)
        new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...

So in the "empty" case, PyMem_Realloc will be asked for 0 bytes. This call will be passed via _PyObject_Malloc down to pymalloc_alloc, which in case of 0 bytes returns NULL:

if (UNLIKELY(nbytes == 0)) {
   return NULL;
}

However, _PyObject_Malloc falls back to the "raw" malloc, if pymalloc returns NULL:

static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
    void* ptr = pymalloc_alloc(ctx, nbytes);
    if (LIKELY(ptr != NULL)) {
        return ptr;
    }

    ptr = PyMem_RawMalloc(nbytes);
    if (ptr != NULL) {
        raw_allocated_blocks++;
    }
    return ptr;
}

as can be easily seen in the definition of _PyMem_RawMalloc:

static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
    /* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
       for malloc(0), which would be treated as an error. Some platforms would
       return a pointer with no memory behind it, which would break pymalloc.
       To solve these problems, allocate an extra byte. */
    if (size == 0)
        size = 1;
    return malloc(size);
}

Thus, the case a=[0] will use pymalloc, while a=[] will use the memory manager of the underlying c-runtime, which explains the observed difference.

Now, this all can be seen as missed optimization, because for newsize=0, we could just set the ob_item to NULL, adjust other members and return.

Let's try it out:

static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    // ...
    if (newsize == 0) {
        PyMem_Del(self->ob_item);
        self->ob_item = NULL;
        Py_SIZE(self) = 0;
        self->allocated = 0;
        return 0;
    }
    // ...
}

with this fix, the empty-case becomes slightly faster (about 10%) than the a=[0] case, as expected.

My claim, that pymalloc is faster for smaller sizes than the C-runtime memory manager, can be easily tested with bytes: if more than 512 bytes need to be allocated, pymalloc will fallback to simple malloc:

print(bytes(479).__sizeof__())   #  512
%timeit bytes(479)               # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__())   #  513
%timeit bytes(480)               # 296 ns ± 24.8 ns

the actual difference is more than the shown 50% (this jump cannot be explained by change of the size by one byte alone), as at least some part of the time is used for initialization of byte object and so on.

Here is a more direct comparison with help of cython:

%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del

def with_pymalloc(int size):
    cdef int i
    for i in range(1000):
        PyMem_Del(PyMem_Malloc(size))
        
def with_cmalloc(int size):
    cdef int i
    for i in range(1000):
        free(malloc(size))

and now

%timeit with_pymalloc(1)   #  15.8 µs ± 566 ns
%timeit with_cmalloc(1)    #  51.9 µs ± 2.17 µs

i.e. pymalloc is about 3 times faster (or about 35ns per allocation). Note: some compilers would optimize free(malloc(size)) out, but MSVC doesn't.

As another example: some time ago I have replaced the default allocator through pymalloc for a c++'s std::map which led to a speed up of factor 4.

For profiling the following script was used:

a=[0] # or a=[]
for _ in range(10000000):
    list(x for x in a)

together with VisualStudio's built-in performance profiler in Release-mode.

a=[0]-version needed 6.6 seconds (in profiler) while a=[] version needed 6.9 seconds (i.e. ca. 5% slower). After the "fix", a=[] needed only 5.8 seconds.

The share of time spent in list_resize and _PyObject_Realloc:

                     a=[0]          a=[]       a=[], fixed        
list_resize           3.5%          10.2%          3%
_PyObject_Realloc     3.2%           9.3%          1%

Obviously, there is variance from run to run, but the differences in running times are significant and can explain the lion's share of observed time difference.

Note: the difference of 0.3 second for 10^7 allocations is about 30ns per allocation - a number similar to the one we get for the difference between pymalloc's and c-runtime's allocations.

When verifying the above with debugger, one must be aware, that in the debug-mode Python uses a debug version of pymalloc, which appends additional data to the required memory, thus pymalloc will never be asked to allocate 0 bytes in debug-version, but 0 bytes + debug-overhead and there will be no fallback to malloc. Thus, one should either debug in release mode of switch to realease-pymalloc in debug-build (there is probably an option for it - I just don't know it, the relevant part in code is here and here).

188

answered Oct 09 '22 06:10

ead

Related questions
                            
                                How to define composite primary key in SQLAlchemy
                            
                                What does the "x for x in" syntax mean? [duplicate]
                            
                                Which is better - PyInstaller or cx_Freeze? [closed]
                            
                                DataFrame constructor not properly called! error
                            
                                ufunc 'add' did not contain loop with signature matching type dtype ('S32') ('S32') ('S32')
                            
                                Python argparse: Is there a way to specify a range in nargs?
                            
                                Mocking a Django Queryset in order to test a function that takes a queryset
                            
                                Python 3.5+: How to dynamically import a module given the full file path (in the presence of implicit sibling imports)?
                            
                                Why do we have to use the __dunder__ methods instead of operators when calling via super?
                            
                                How to provide a reproducible copy of your DataFrame with to_clipboard()
                            
                                Equivalent to matlab's imagesc in matplotlib? [duplicate]
                            
                                How to extract numbers (along with comparison adjectives or ranges)
                            
                                Eventlet or gevent or Stackless + Twisted, Pylons, Django and SQL Alchemy
                            
                                Which form of connection to use with pika
                            
                                Python Classes without using def __init__(self)
                            
                                How to install / update package with pipenv without updating the rest of packages
                            
                                How do I propagate C++ exceptions to Python in a SWIG wrapper library?
                            
                                Python: thinking of a module and its variables as a singleton — Clean approach?
                            
                                Python "expected an indented block"
                            
                                How do I use different Python version in venv from standard library? (Not virtualenv!)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is list(x for x in a) faster for a=[0] than for a=[]?

Tags:

performance

python

python-internals

cpython

Kelly Bundy

People also ask

1 Answers

ead

Recent Activity

Donate For Us