Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is list(x for x in a) faster for a=[0] than for a=[]?

I tested list(x for x in a) with three different CPython versions. On a = [0] it's significantly faster than on a = []:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

465 ns  412 ns     543 ns  515 ns     513 ns  457 ns   
450 ns  406 ns     544 ns  515 ns     506 ns  491 ns   
456 ns  408 ns     551 ns  513 ns     515 ns  487 ns   
455 ns  413 ns     548 ns  516 ns     513 ns  491 ns   
452 ns  404 ns     549 ns  511 ns     508 ns  486 ns   

With tuple instead of list, it's the expected other way around:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

354 ns  405 ns     467 ns  514 ns     421 ns  465 ns   
364 ns  407 ns     467 ns  527 ns     425 ns  464 ns   
353 ns  399 ns     490 ns  549 ns     419 ns  465 ns   
352 ns  400 ns     500 ns  556 ns     414 ns  474 ns   
354 ns  405 ns     494 ns  560 ns     420 ns  474 ns   

So why is list faster when it (and the underlying generator iterator) has to do more?

Tested on Windows 10 Pro 2004 64-bit.

Benchmark code:

from timeit import repeat

setups = 'a = []', 'a = [0]'
number = 10**6

print(*setups, sep='   ')
for _ in range(5):
    for setup in setups:
        t = min(repeat('list(x for x in a)', setup, number=number)) / number
        print('%d ns' % (t * 1e9), end='   ')
    print()

Byte sizes, showing that it doesn't overallocate for input [] but does for input [0]:

>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40

>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72
like image 224
Kelly Bundy Avatar asked Oct 19 '20 13:10

Kelly Bundy


People also ask

Why is a set faster than a list?

Whether a set is faster or slower than a list depends on what you are doing with it. Iterating through a list is at least as fast as iterating through the elements of a set (usually faster, but perhaps not noticeably so). Membership tests (x in s) are much faster with sets.

Why is list () faster than list () in Python?

To conclude, [] is faster because it’s a literal syntax, whereas list () is slower because Python treats list () just as any other user defined function, which when executed returns a empty list. Hope I was clear! Thanks for reading! , Software developer and consultant in Gothenburg, Sweden. Will Python ever become faster? Yes.

Is list comprehension in Python always faster than for-loops?

List Comprehension in Python - Is it always faster than For-loops? It is widely believed that in Python the usage of list comprehension would always be faster than for-loops. This paper shows that it is faster, but only for simple functions used in loops.

Why is Python so much faster than C++?

With Python, it's easier to experiment, test, refactor etc, so you might well have a faster Python program than the competitor who tried to beat you with ultra-fast C++, because you had time to drop unused features and try out several different algorithms by the time they had their first version ready.


1 Answers

What you observe, is that pymalloc (Python memory manager) is faster than the memory manager provided by your C-runtime.

It is easy to see in the profiler, that the main difference between both versions is that list_resize and _PyObjectRealloc need more time for the a=[]-case. But why?

When a new list is created from an iterable, the list tries to get a hint how many elements are in the iterator:

n = PyObject_LengthHint(iterable, 8);

However, this doesn't work for generators and thus the hint is the default value 8.

After the iterator is exhausted, the list tries to shrink, because there are only 0 or 1 element (and not the original capacity allocated due to a too large size-hint). For 1 element this would lead to (due to over-allocation) capacity of 4 elements. However, there is a special handling for the case of 0 elements: it will not be over-allocated:

// ...
if (newsize == 0)
        new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...

So in the "empty" case, PyMem_Realloc will be asked for 0 bytes. This call will be passed via _PyObject_Malloc down to pymalloc_alloc, which in case of 0 bytes returns NULL:

if (UNLIKELY(nbytes == 0)) {
   return NULL;
}

However, _PyObject_Malloc falls back to the "raw" malloc, if pymalloc returns NULL:

static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
    void* ptr = pymalloc_alloc(ctx, nbytes);
    if (LIKELY(ptr != NULL)) {
        return ptr;
    }

    ptr = PyMem_RawMalloc(nbytes);
    if (ptr != NULL) {
        raw_allocated_blocks++;
    }
    return ptr;
}

as can be easily seen in the definition of _PyMem_RawMalloc:

static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
    /* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
       for malloc(0), which would be treated as an error. Some platforms would
       return a pointer with no memory behind it, which would break pymalloc.
       To solve these problems, allocate an extra byte. */
    if (size == 0)
        size = 1;
    return malloc(size);
}

Thus, the case a=[0] will use pymalloc, while a=[] will use the memory manager of the underlying c-runtime, which explains the observed difference.


Now, this all can be seen as missed optimization, because for newsize=0, we could just set the ob_item to NULL, adjust other members and return.

Let's try it out:

static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    // ...
    if (newsize == 0) {
        PyMem_Del(self->ob_item);
        self->ob_item = NULL;
        Py_SIZE(self) = 0;
        self->allocated = 0;
        return 0;
    }
    // ...
}

with this fix, the empty-case becomes slightly faster (about 10%) than the a=[0] case, as expected.


My claim, that pymalloc is faster for smaller sizes than the C-runtime memory manager, can be easily tested with bytes: if more than 512 bytes need to be allocated, pymalloc will fallback to simple malloc:

print(bytes(479).__sizeof__())   #  512
%timeit bytes(479)               # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__())   #  513
%timeit bytes(480)               # 296 ns ± 24.8 ns

the actual difference is more than the shown 50% (this jump cannot be explained by change of the size by one byte alone), as at least some part of the time is used for initialization of byte object and so on.

Here is a more direct comparison with help of cython:

%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del

def with_pymalloc(int size):
    cdef int i
    for i in range(1000):
        PyMem_Del(PyMem_Malloc(size))
        
def with_cmalloc(int size):
    cdef int i
    for i in range(1000):
        free(malloc(size))  

and now

%timeit with_pymalloc(1)   #  15.8 µs ± 566 ns
%timeit with_cmalloc(1)    #  51.9 µs ± 2.17 µs

i.e. pymalloc is about 3 times faster (or about 35ns per allocation). Note: some compilers would optimize free(malloc(size)) out, but MSVC doesn't.

As another example: some time ago I have replaced the default allocator through pymalloc for a c++'s std::map which led to a speed up of factor 4.


For profiling the following script was used:

a=[0] # or a=[]
for _ in range(10000000):
    list(x for x in a)

together with VisualStudio's built-in performance profiler in Release-mode.

a=[0]-version needed 6.6 seconds (in profiler) while a=[] version needed 6.9 seconds (i.e. ca. 5% slower). After the "fix", a=[] needed only 5.8 seconds.

The share of time spent in list_resize and _PyObject_Realloc:

                     a=[0]          a=[]       a=[], fixed        
list_resize           3.5%          10.2%          3%
_PyObject_Realloc     3.2%           9.3%          1%

Obviously, there is variance from run to run, but the differences in running times are significant and can explain the lion's share of observed time difference.

Note: the difference of 0.3 second for 10^7 allocations is about 30ns per allocation - a number similar to the one we get for the difference between pymalloc's and c-runtime's allocations.


When verifying the above with debugger, one must be aware, that in the debug-mode Python uses a debug version of pymalloc, which appends additional data to the required memory, thus pymalloc will never be asked to allocate 0 bytes in debug-version, but 0 bytes + debug-overhead and there will be no fallback to malloc. Thus, one should either debug in release mode of switch to realease-pymalloc in debug-build (there is probably an option for it - I just don't know it, the relevant part in code is here and here).

like image 188
ead Avatar answered Oct 09 '22 06:10

ead