I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:
list(map(operator.eq, seq1, seq2))
Then I found starmap
function from itertools
, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):
from operator import eq
from itertools import starmap
seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5
gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))
%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)
271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list
. But from somewhere we gain extra 13% in time:
%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))
5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list
function?
EDIT:
My first intention was to perform element-wise comparison instead of all
, so the all
function was replaced with list
. This replacement does not affect the timing ratio.
CPython 3.6.2 on Windows 10 (64bit)
It uses the Pool. starmap method, which accepts a sequence of argument tuples. It then automatically unpacks the arguments from each tuple and passes them to the given function: import multiprocessing from itertools import product def merge_names(a, b): return '{} & {}'.
The starmap() method of the itertools module returns an iterator that applies to the given function by passing the elements of the specified iterable as arguments to the function.
We can then call the starmap() function on the process pool to apply our task() function to each tuple of arguments in the prepared list. This returns an iterator over the results returned from the task() function, in the order that function calls are completed.
Python's map() is a built-in function that allows you to process and transform all the items in an iterable without using an explicit for loop, a technique commonly known as mapping. map() is useful when you need to apply a transformation function to each item in an iterable and transform them into a new iterable.
There are several factors that contribute (in conjunction) to the observed performance difference:
zip
re-uses the returned tuple
if it has a reference count of 1 when the next __next__
call is made.map
builds a new tuple
that is passed to the "mapped function" every time a __next__
call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that case map
has to find an unused tuple of the right size.starmap
checks if the next item in the iterable is of type tuple
and if so it just passes it on.PyObject_Call
won't create a new tuple that is passed to the callee.So starmap
with zip
will only use one tuple over and over again that is passed to operator.eq
thus reducing the function call overhead immensely. map
on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq
is called. So what is actually the speed difference is just the tuple creation overhead.
Instead of linking to the source code I'll provide some Cython code that can be used to verify this:
In [1]: %load_ext cython
In [2]: %%cython
...:
...: from cpython.ref cimport Py_DECREF
...:
...: cpdef func(zipper):
...: a = next(zipper)
...: print('a', a)
...: Py_DECREF(a)
...: b = next(zipper)
...: print('a', a)
In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)
Yes, tuple
s aren't really immutable, a simple Py_DECREF
was sufficient to "trick" zip
into believing noone else holds a reference to the returned tuple!
As for the "tuple-pass-thru":
In [4]: %%cython
...:
...: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
In [5]: func(1, 2)
1404350461320
1404350461320
So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:
In [6]: def func_inner(*args):
...: print(id(args))
...:
...: def func(*args):
...: print(id(args))
...: func_inner(*args)
...:
In [7]: func(1, 2)
1404350436488
1404352833800
Note that it also doesn't happen if the called function isn't a C function even if called from a C function:
In [8]: %%cython
...:
...: def func_inner_c(*args):
...: print(id(args))
...:
...: def func(inner, *args):
...: print(id(args))
...: inner(*args)
...:
In [9]: def func_inner_py(*args):
...: print(id(args))
...:
...:
In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184
In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824
So there are a lot of "coincidences" leading up to the point that starmap
with zip
is faster than calling map
with multiple arguments when the called function is also a C function...
One difference I can notice is the how map
retrieves items from the iterables. Both map
and zip
create a tuple of iterators from each iterable passed. Now zip
maintains a result tuple internally that is populated every time next is called and on the other hand, map
creates a new array* with each next call and deallocates it.
*As pointed out by MSeifert till 3.5.4 map_next
used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With