I tested list(x for x in a)
with three different CPython versions. On a = [0]
it's significantly faster than on a = []
:
3.9.0 64-bit 3.9.0 32-bit 3.7.8 64-bit
a = [] a = [0] a = [] a = [0] a = [] a = [0]
465 ns 412 ns 543 ns 515 ns 513 ns 457 ns
450 ns 406 ns 544 ns 515 ns 506 ns 491 ns
456 ns 408 ns 551 ns 513 ns 515 ns 487 ns
455 ns 413 ns 548 ns 516 ns 513 ns 491 ns
452 ns 404 ns 549 ns 511 ns 508 ns 486 ns
With tuple
instead of list
, it's the expected other way around:
3.9.0 64-bit 3.9.0 32-bit 3.7.8 64-bit
a = [] a = [0] a = [] a = [0] a = [] a = [0]
354 ns 405 ns 467 ns 514 ns 421 ns 465 ns
364 ns 407 ns 467 ns 527 ns 425 ns 464 ns
353 ns 399 ns 490 ns 549 ns 419 ns 465 ns
352 ns 400 ns 500 ns 556 ns 414 ns 474 ns
354 ns 405 ns 494 ns 560 ns 420 ns 474 ns
So why is list
faster when it (and the underlying generator iterator) has to do more?
Tested on Windows 10 Pro 2004 64-bit.
Benchmark code:
from timeit import repeat
setups = 'a = []', 'a = [0]'
number = 10**6
print(*setups, sep=' ')
for _ in range(5):
for setup in setups:
t = min(repeat('list(x for x in a)', setup, number=number)) / number
print('%d ns' % (t * 1e9), end=' ')
print()
Byte sizes, showing that it doesn't overallocate for input []
but does for input [0]
:
>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40
>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72
Whether a set is faster or slower than a list depends on what you are doing with it. Iterating through a list is at least as fast as iterating through the elements of a set (usually faster, but perhaps not noticeably so). Membership tests (x in s) are much faster with sets.
To conclude, [] is faster because it’s a literal syntax, whereas list () is slower because Python treats list () just as any other user defined function, which when executed returns a empty list. Hope I was clear! Thanks for reading! , Software developer and consultant in Gothenburg, Sweden. Will Python ever become faster? Yes.
List Comprehension in Python - Is it always faster than For-loops? It is widely believed that in Python the usage of list comprehension would always be faster than for-loops. This paper shows that it is faster, but only for simple functions used in loops.
With Python, it's easier to experiment, test, refactor etc, so you might well have a faster Python program than the competitor who tried to beat you with ultra-fast C++, because you had time to drop unused features and try out several different algorithms by the time they had their first version ready.
What you observe, is that pymalloc
(Python memory manager) is faster than the memory manager provided by your C-runtime.
It is easy to see in the profiler, that the main difference between both versions is that list_resize
and _PyObjectRealloc
need more time for the a=[]
-case. But why?
When a new list is created from an iterable, the list tries to get a hint how many elements are in the iterator:
n = PyObject_LengthHint(iterable, 8);
However, this doesn't work for generators and thus the hint is the default value 8
.
After the iterator is exhausted, the list tries to shrink, because there are only 0 or 1 element (and not the original capacity allocated due to a too large size-hint). For 1 element this would lead to (due to over-allocation) capacity of 4 elements. However, there is a special handling for the case of 0 elements: it will not be over-allocated:
// ...
if (newsize == 0)
new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...
So in the "empty" case, PyMem_Realloc
will be asked for 0 bytes. This call will be passed via _PyObject_Malloc
down to pymalloc_alloc
, which in case of 0 bytes returns NULL
:
if (UNLIKELY(nbytes == 0)) {
return NULL;
}
However, _PyObject_Malloc
falls back to the "raw" malloc, if pymalloc
returns NULL
:
static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
void* ptr = pymalloc_alloc(ctx, nbytes);
if (LIKELY(ptr != NULL)) {
return ptr;
}
ptr = PyMem_RawMalloc(nbytes);
if (ptr != NULL) {
raw_allocated_blocks++;
}
return ptr;
}
as can be easily seen in the definition of _PyMem_RawMalloc
:
static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
/* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
for malloc(0), which would be treated as an error. Some platforms would
return a pointer with no memory behind it, which would break pymalloc.
To solve these problems, allocate an extra byte. */
if (size == 0)
size = 1;
return malloc(size);
}
Thus, the case a=[0]
will use pymalloc
, while a=[]
will use the memory manager of the underlying c-runtime, which explains the observed difference.
Now, this all can be seen as missed optimization, because for newsize=0
, we could just set the ob_item
to NULL
, adjust other members and return.
Let's try it out:
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
// ...
if (newsize == 0) {
PyMem_Del(self->ob_item);
self->ob_item = NULL;
Py_SIZE(self) = 0;
self->allocated = 0;
return 0;
}
// ...
}
with this fix, the empty-case becomes slightly faster (about 10%) than the a=[0]
case, as expected.
My claim, that pymalloc
is faster for smaller sizes than the C-runtime memory manager, can be easily tested with bytes
: if more than 512 bytes need to be allocated, pymalloc
will fallback to simple malloc
:
print(bytes(479).__sizeof__()) # 512
%timeit bytes(479) # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__()) # 513
%timeit bytes(480) # 296 ns ± 24.8 ns
the actual difference is more than the shown 50% (this jump cannot be explained by change of the size by one byte alone), as at least some part of the time is used for initialization of byte object and so on.
Here is a more direct comparison with help of cython:
%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del
def with_pymalloc(int size):
cdef int i
for i in range(1000):
PyMem_Del(PyMem_Malloc(size))
def with_cmalloc(int size):
cdef int i
for i in range(1000):
free(malloc(size))
and now
%timeit with_pymalloc(1) # 15.8 µs ± 566 ns
%timeit with_cmalloc(1) # 51.9 µs ± 2.17 µs
i.e. pymalloc
is about 3 times faster (or about 35ns per allocation). Note: some compilers would optimize free(malloc(size))
out, but MSVC doesn't.
As another example: some time ago I have replaced the default allocator through pymalloc for a c++'s std::map
which led to a speed up of factor 4.
For profiling the following script was used:
a=[0] # or a=[]
for _ in range(10000000):
list(x for x in a)
together with VisualStudio's built-in performance profiler in Release-mode.
a=[0]
-version needed 6.6 seconds (in profiler) while a=[]
version needed 6.9 seconds (i.e. ca. 5% slower). After the "fix", a=[]
needed only 5.8 seconds.
The share of time spent in list_resize
and _PyObject_Realloc
:
a=[0] a=[] a=[], fixed
list_resize 3.5% 10.2% 3%
_PyObject_Realloc 3.2% 9.3% 1%
Obviously, there is variance from run to run, but the differences in running times are significant and can explain the lion's share of observed time difference.
Note: the difference of 0.3
second for 10^7
allocations is about 30ns per allocation - a number similar to the one we get for the difference between pymalloc's and c-runtime's allocations.
When verifying the above with debugger, one must be aware, that in the debug-mode Python uses a debug version of pymalloc, which appends additional data to the required memory, thus pymalloc will never be asked to allocate 0 bytes in debug-version, but 0 bytes + debug-overhead
and there will be no fallback to malloc
. Thus, one should either debug in release mode of switch to realease-pymalloc in debug-build (there is probably an option for it - I just don't know it, the relevant part in code is here and here).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With