I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was: <pre class="prettyprint"><code>list(map(operator.eq, seq1, seq2)) </code></pre> Then I found <code>starmap</code> function from <code>itertools</code>, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct): <pre class="prettyprint"><code>from operator import eq from itertools import starmap seq1 = [1,2,3]*10000 seq2 = [1,2,3]*10000 seq2[-1] = 5 gen1 = map(eq, seq1, seq2)) gen2 = starmap(eq, zip(seq1, seq2)) %timeit -n1000 -r10 next(gen1) %timeit -n1000 -r10 next(gen2) 271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each) 208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each) </code></pre> In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for <code>list</code>. But from somewhere we gain extra 13% in time: <pre class="prettyprint"><code>%timeit list(map(eq, seq1, seq2)) %timeit list(starmap(eq, zip(seq1, seq2))) 5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in <code>list</code> function? EDIT: My first intention was to perform element-wise comparison instead of <code>all</code>, so the <code>all</code> function was replaced with <code>list</code>. This replacement does not affect the timing ratio. CPython 3.6.2 on Windows 10 (64bit)

One difference I can notice is the how <code>map</code> retrieves items from the iterables. Both <code>map</code> and <code>zip</code> create a tuple of iterators from each iterable passed. Now <code>zip</code> maintains a result tuple internally that is populated every time next is called and on the other hand, <code>map</code> creates a new array* with each next call and deallocates it. <hr> *As pointed out by MSeifert till 3.5.4 <code>map_next</code> used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809

Performance of map vs starmap?

Tags:

performance

python

python-3.x

cpython

itertools

I was trying to make a pure-python (without external dependencies) element-wise comparison of two sequences. My first solution was:

list(map(operator.eq, seq1, seq2))

Then I found starmap function from itertools, which seemed pretty similar to me. But it turned out to be 37% faster on my computer in worst case. As it was not obvious to me, I measured the time necessary to retrieve 1 element from a generator (don't know if this way is correct):

from operator import eq
from itertools import starmap

seq1 = [1,2,3]*10000
seq2 = [1,2,3]*10000
seq2[-1] = 5

gen1 = map(eq, seq1, seq2))
gen2 = starmap(eq, zip(seq1, seq2))

%timeit -n1000 -r10 next(gen1)
%timeit -n1000 -r10 next(gen2)

271 ns ± 1.26 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)
208 ns ± 1.72 ns per loop (mean ± std. dev. of 10 runs, 1000 loops each)

In retrieving elements the second solution is 24% more performant. After that, they both produce the same results for list. But from somewhere we gain extra 13% in time:

%timeit list(map(eq, seq1, seq2))
%timeit list(starmap(eq, zip(seq1, seq2)))

5.24 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.34 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I don't know how to dig deeper in profiling of such nested code? So my question is why the first generator so faster in retrieving and from where we gain extra 13% in list function?

EDIT: My first intention was to perform element-wise comparison instead of all, so the all function was replaced with list. This replacement does not affect the timing ratio.

CPython 3.6.2 on Windows 10 (64bit)

280

asked Sep 12 '17 08:09

godaygo

2 Answers

There are several factors that contribute (in conjunction) to the observed performance difference:

zip re-uses the returned tuple if it has a reference count of 1 when the next __next__ call is made.
map builds a new tuple that is passed to the "mapped function" every time a __next__ call is made. Actually it probably won't create a new tuple from scratch because Python maintains a storage for unused tuples. But in that case map has to find an unused tuple of the right size.
starmap checks if the next item in the iterable is of type tuple and if so it just passes it on.
Calling a C function from within C code with PyObject_Call won't create a new tuple that is passed to the callee.

So starmap with zip will only use one tuple over and over again that is passed to operator.eq thus reducing the function call overhead immensely. map on the other hand will create a new tuple (or fill a C array from CPython 3.6 on) every time operator.eq is called. So what is actually the speed difference is just the tuple creation overhead.

Instead of linking to the source code I'll provide some Cython code that can be used to verify this:

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: from cpython.ref cimport Py_DECREF
   ...:
   ...: cpdef func(zipper):
   ...:     a = next(zipper)
   ...:     print('a', a)
   ...:     Py_DECREF(a)
   ...:     b = next(zipper)
   ...:     print('a', a)

In [3]: func(zip([1, 2], [1, 2]))
a (1, 1)
a (2, 2)

Yes, tuples aren't really immutable, a simple Py_DECREF was sufficient to "trick" zip into believing noone else holds a reference to the returned tuple!

As for the "tuple-pass-thru":

In [4]: %%cython
   ...:
   ...: def func_inner(*args):
   ...:     print(id(args))
   ...:
   ...: def func(*args):
   ...:     print(id(args))
   ...:     func_inner(*args)

In [5]: func(1, 2)
1404350461320
1404350461320

So the tuple is passed right through (just because these are defined as C functions!) This doesn't happen for pure Python functions:

In [6]: def func_inner(*args):
   ...:     print(id(args))
   ...:
   ...: def func(*args):
   ...:     print(id(args))
   ...:     func_inner(*args)
   ...:

In [7]: func(1, 2)
1404350436488
1404352833800

Note that it also doesn't happen if the called function isn't a C function even if called from a C function:

In [8]: %%cython
   ...: 
   ...: def func_inner_c(*args):
   ...:     print(id(args))
   ...: 
   ...: def func(inner, *args):
   ...:     print(id(args))
   ...:     inner(*args)
   ...:

In [9]: def func_inner_py(*args):
    ...:     print(id(args))
    ...:
    ...:

In [10]: func(func_inner_py, 1, 2)
1404350471944
1404353010184

In [11]: func(func_inner_c, 1, 2)
1404344354824
1404344354824

So there are a lot of "coincidences" leading up to the point that starmap with zip is faster than calling map with multiple arguments when the called function is also a C function...

119

answered Sep 23 '22 05:09

MSeifert

One difference I can notice is the how map retrieves items from the iterables. Both map and zip create a tuple of iterators from each iterable passed. Now zip maintains a result tuple internally that is populated every time next is called and on the other hand, map creates a new array^* with each next call and deallocates it.

*_{As pointed out by MSeifert till 3.5.4 map_next used to allocate a new Python tuple everytime. This changed in 3.6 and till 5 iterables C stack is used and for anything larger than that heap is used. Related PRs: Issue #27809: map_next() uses fast call and Add _PY_FASTCALL_SMALL_STACK constant | Issue: https://bugs.python.org/issue27809}

answered Sep 21 '22 05:09

Ashwini Chaudhary

Related questions
                            
                                Python equivalent of unix cksum function
                            
                                Python AppIndicator bindings -> howto check if the menu is open?
                            
                                Why does pickle protocol 2 let me serialise an open file object?
                            
                                Python performance vs PHP [closed]
                            
                                Making custom containers work with **kwargs (how does Python expand the args?)
                            
                                Cartesian product of large iterators (itertools)
                            
                                Learning and using augmented Bayes classifiers in python
                            
                                Python tools for out-of-core computation/data mining
                            
                                How to duplicate rows in pandas, based on items in a list [duplicate]
                            
                                Pandas DataFrame.unstack() Changes Order of Row and Column Headers
                            
                                How Do I Queue My Python Locks?
                            
                                Processing an image of a table to get data from it
                            
                                Using Python+Theano with OpenCL in an AMD GPU
                            
                                How to package a linked DLL and a pyd file into one self contained pyd file?
                            
                                What to do with missing values when plotting with seaborn?
                            
                                How to slice a list of tuples in python?
                            
                                Tensorflow: How to modify the value in tensor
                            
                                How to have list() consume __iter__ without calling __len__?
                            
                                How do I run celery status/flower without the -A option?
                            
                                How `[System.Console]::OutputEncoding/InputEncoding` with Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With