UPDATE (mirroring the state-of-the-art knowledge level) status: 2017-05-12
The reason for this update is the fact that at the time I was asking this question I was not aware that I have discovered something about how Python3 works "under the hood".
The conclusion from all what will follow is:
If you write own Python3 code for an iterator and care about speed of execution you should write it as a generator function and not as an iterator class.
Below a minimalistic code example demonstrating that the same algorithm (here: self-made version of Pythons range()
) expressed as a generator function runs much faster than if expressed as an iterator class:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
N = 10000000
print(" Size of created list N = {} elements (ints 1 to N)".format(N))
from time import time as t
from customRange import gnrtYieldRange as cthnYieldRange
from customRange import cintYieldRange
from customRange import iterClassRange as cthnClassRange
from customRange import cdefClassRange
iterPythnRangeObj = range(1, N+1)
gnrtYieldRangeObj = gnrtYieldRange(1, N)
cthnYieldRangeObj = cthnYieldRange(1, N)
cintYieldRangeObj = cintYieldRange(1, N)
iterClassRangeObj = iterClassRange(1, N)
cthnClassRangeObj = cthnClassRange(1, N)
cdefClassRangeObj = cdefClassRange(1, N)
sEXECs = [
"liPR = list(iterPythnRangeObj)",
"lgYR = list(gnrtYieldRangeObj)",
"lcYR = list(cthnYieldRangeObj)",
"liGR = list(cintYieldRangeObj)",
"liCR = list(iterClassRangeObj)",
"lcCR = list(cthnClassRangeObj)",
"ldCR = list(cdefClassRangeObj)"
]
sCOMMENTs = [
"Python3 own range(1, N+1) used here as reference for timings ",
"self-made range generator function using yield (run as it is) ",
"self-made range (with yield) run from module created by Cython",
"Cython-optimized self-made range (using yield) run from module",
"self-made range as iterator class using __next__() and return ",
"self-made range (using __next__) from module created by Cython",
"Cython-optimized self-made range (using __next__) from module "
]
for idx, sEXEC in enumerate(sEXECs):
s=t();exec(sEXEC);e=t();print("{} takes: {:3.1f} sec.".format(sCOMMENTs[idx], e-s))
print("All created lists are equal:", all([liPR == lgYR, lgYR == lcYR, lcYR == liGR, liGR == liCR, liCR == lcCR, lcCR == ldCR]) )
print("Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'")
The code above put into a file and runned prints to stdout:
>python3.6 -u "gnrtFunction-fasterThan-iterClass_runMe.py"
Size of created list N = 10000000 elements (ints 1 to N)
Python3 own range(1, N+1) used here as reference for timings takes: 0.2 sec.
self-made range generator function using yield (run as it is) takes: 1.1 sec.
self-made range (with yield) run from module created by Cython takes: 0.5 sec.
Cython-optimized self-made range (using yield) run from module takes: 0.3 sec.
self-made range as iterator class using __next__() and return takes: 3.9 sec.
self-made range (using __next__) from module created by Cython takes: 3.3 sec.
Cython-optimized self-made range (using __next__) from module takes: 0.2 sec.
All created lists are equal: True
Run on Linux Mint 18.1, used Cython.__version__ == '0.25.2'
>Exit code: 0
From the timings above you can see that the generator function variant of the self-made range()
iterator runs faster than the iterator class variant and when no optimization of code is involved this behavior propagates also into C-code level of C-code created by Cython.
If you are curious why in detail it is that way you can read through the provided answer(s) or play yourself a bit with the provided code yourself.
Below the missing pieces of code necessary to run the code above:
customRange.pyx
- the file Cython creates the customRange
module from:
def gnrtYieldRange(startWith, endAt, step=1):
while startWith <= endAt:
yield startWith
startWith += step
class iterClassRange:
def __init__(self, startWith, endAt, step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
def cintYieldRange(int startWith, int endAt, int step=1):
while startWith <= endAt:
yield startWith
startWith += step
cdef class cdefClassRange:
cdef int startWith
cdef int endAt
cdef int step
def __init__(self, int startWith, int endAt, int step=1):
self.startWith = startWith - 1
self.endAt = endAt
self.step = step
def __iter__(self):
return self
def __next__(self):
self.startWith += self.step
if self.startWith <= self.endAt:
return self.startWith
else:
raise StopIteration
and the setup file customRange-setup.py
used to create the Python customRange
module:
import sys
sys.argv += ['build_ext', '--inplace']
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'customRange',
ext_modules = cythonize("customRange.pyx"),
)
At the time I have asked this question I was busy with a quite complex
algorithm for generating unique combinations from a non-unique list available in form of a generator function using yield
. My goal was to create a Python module written in C using this algorithm to make it run faster. For this purpose I have rewritten the generator function which used yield
to an iterator class using __next__()
and return
. As I compared the speed of both variants of the algorithm I was surprised that the iterator class was two times slower than the generator function and I had (wrongly) assumed it has something to do with the way I have rewritten the algorithm (you need to know this if you want to better understand what the answers here are about) and had therefore
Originally asked how to make the iterator class version run at the same speed as the generator function and where the speed difference comes from?.
Below some more about the HISTORY of the question:
In the below provided Python script code exactly the same algorithm for creating unique combinations from a non-unique list of elements was implemented using a Python function
with yield
and using a class
with __next__
. The code is ready to run after copy/paste, so you can see it for yourself what I am speaking about.
The same phenomenon observed for pure Python code propagates into C code of a Python extension module created out of the script code by Cython, so it is not limited to Python level code because it doesn't vanish at the C code level.
The question is:
Where does the huge difference in speed of execution come from? Is there anything that can be done to get both code variants to run at comparable speed? Is there something went wrong with the class/next implementation compared to the function/yield variant? Both are to my knowledge exactly the same code ...
Here the code (tweaking the number in the highlighted line changes the level of uniqueness of elements in the list the combinations are generated from what has a huge impact on the running time):
def uniqCmboYieldIter(lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
idxIntoCmbo, idxIntoUniqs = 0, 0
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
class uniqCmboClassIter:
# ----------------------------------------------------------------------------------------------
def __iter__(self):
return self
# ----------------------------------------------------------------------------------------------
def __init__(self, lstItems, lenCmbo):
dctCounter = {}
lenLstItems = len(lstItems)
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
self.lstUniqs = sorted(dctCounter.keys())
self.lenUniqs = len(self.lstUniqs)
self.lstCntRpts = [dctCounter[item] for item in self.lstUniqs]
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = [0] * self.lenUniqs
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = None
self.y = None
return
# ----------------------------------------------------------------------------------------------
def __next__(self):
if self.stopIteration is True:
raise StopIteration
return
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
# ============================================================================================================================================
lstSize = 48 # 48
uniqLevel = 12 # (7 ~60% unique) higher level => more unique items in the generated list
aList = []
from random import randint
for _ in range(lstSize):
aList.append( ( randint(1,uniqLevel), randint(1,uniqLevel) ) )
lenCmbo = 6
percUnique = 100.0 - 100.0*(lstSize-len(set(aList)))/lstSize
print("======================== lenCmbo:", lenCmbo,
" sizeOfList:", len(aList),
" noOfUniqueInList", len(set(aList)),
" percUnique", int(percUnique) )
import time
from itertools import combinations
# itertools.combinations
# ---
# def uniqCmboYieldIter(lstItems, lenCmbo):
# class uniqCmboClassIter: def __init__(self, lstItems, lenCmbo):
# ---
start_time = time.time()
print("Combos:%9i"%len(list(combinations(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list( combinations(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboYieldIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboYieldIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
start_time = time.time()
print("Combos:%9i"%len(list(uniqCmboClassIter(aList, lenCmbo))), " ", end='')
duration = time.time() - start_time
print("print(len(list(uniqCmboClassIter(aList, lenCmbo)))):", "{:9.5f}".format(duration), "seconds.")
and the timings on my box:
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 32 percUnique 66
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.04635 seconds.
Combos: 1296058 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 3.25447 seconds.
Combos: 1296058 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 5.97371 seconds.
>Exit code: 0
[2017-05-02_03:23] 207474 <-Chrs,Keys-> 1277194 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 22 percUnique 45
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.05199 seconds.
Combos: 191072 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 0.47343 seconds.
Combos: 191072 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 0.89860 seconds.
>Exit code: 0
[2017-05-02_03:23] 207476 <-Chrs,Keys-> 1277202 OnSave(): '/home/claudio/CgMint18/_Cg.DIR/ClaudioOnline/at-stackoverflow/bySubject/uniqueCombinations/nonRecursiveUniqueCombos_Cg.py'
>python3.6 -u "nonRecursiveUniqueCombos_Cg.py"
======================== lenCmbo: 6 sizeOfList: 48 noOfUniqueInList 43 percUnique 89
Combos: 12271512 print(len(list( combinations(aList, lenCmbo)))): 2.17285 seconds.
Combos: 6560701 print(len(list(uniqCmboYieldIter(aList, lenCmbo)))): 16.72573 seconds.
Combos: 6560701 print(len(list(uniqCmboClassIter(aList, lenCmbo)))): 31.17714 seconds.
>Exit code: 0
UPDATE (status 2017-05-07):
At the time of asking the question and offering a bounty it was not known to me that there is a way to easily create C code of an extension module for an iterator object out of Python script code using Cython and that such C code can be created also from an iterator function using
yield
.
Considering that the generated faster version of the C extension module is still not fast enough to compete with itertools.combinations
it doesn't make much sense to dive deeply into knowing what exactly is causing the slowdown when using an iterator class compared to an iterator function and how to overcome this. It makes much more sense to find a way to speed up the faster version using Cython, especially because I am a total novice in writing Python extension modules failing to create a working code after hours and hours of intense focused work spend on tweaking existing C code of itertools.combinations
with own modifications because of Segmentation Fault
errors for which I was not able to grasp the reason of.
Currently I think that there is still room to speed up the by me used Cython code and no need to go the harder way of writing the C code myself.
Below Cython code that runs OK and for speed optimized Cython code which changes somehow (I can't currently see the reason for that) the way the algorithm works and produce therefore wrong results. The idea behind the Cython optimization was to use in Cython code Python/Cython arrays instead of a Python lists. Any hints how to get a faster running Python extension module out of the used algorithm in a for a novice "safe" way are welcome.
def subbags_by_loops_with_dict_counter(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cmboAsIdxUniqs = [None] * lenCmbo
multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
cmboAsIdxUniqs[idxIntoCmbo : idxIntoCmbo + count] = [idxIntoUniqs] * count
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
Below OPTIMIZED CYTHON CODE which produces wrong results:
def subbags_loops_dict_cython_optimized(lstItems, int lenCmbo):
dctCounter = {}
cdef int lenLstItems = len(lstItems)
cdef int idx = 0
for idx in range(lenLstItems):
item = lstItems[idx]
if item in dctCounter.keys():
dctCounter[item] += 1
else:
dctCounter[item] = 1
#:if
#:for
lstUniqs = sorted(dctCounter.keys())
lstCntRpts = [dctCounter[item] for item in lstUniqs]
cdef int lenUniqs = len(lstUniqs)
cdef array.array cmboAsIdxUniqs = array.array('i', [])
array.resize(cmboAsIdxUniqs, lenCmbo)
# cmboAsIdxUniqs = [None] * lenCmbo
cdef array.array multiplicities = array.array('i', [])
array.resize(multiplicities, lenUniqs)
# multiplicities = [0] * lenUniqs
cdef int idxIntoCmbo
cdef int maxIdxCmbo
cdef int curIdxCmbo
cdef int idxIntoUniqs
cdef int count
while idxIntoCmbo != lenCmbo and idxIntoUniqs != lenUniqs:
count = min(lstCntRpts[idxIntoUniqs], lenCmbo-idxIntoCmbo)
maxIdxCmbo = idxIntoCmbo + count
curIdxCmbo = idxIntoCmbo
while curIdxCmbo < maxIdxCmbo:
cmboAsIdxUniqs[curIdxCmbo] = idxIntoUniqs
curIdxCmbo += 1
multiplicities[idxIntoUniqs] = count
idxIntoCmbo += count
idxIntoUniqs += 1
# print("multiplicities:", multiplicities)
# print("cmboAsIdxUniqs:", cmboAsIdxUniqs)
if idxIntoCmbo != lenCmbo:
return
cdef int x
cdef int y
while True:
yield tuple(lstUniqs[idxUniqs] for idxUniqs in cmboAsIdxUniqs)
for idxIntoCmbo in reversed(range(lenCmbo)):
x = cmboAsIdxUniqs[idxIntoCmbo]
y = x + 1
if y < lenUniqs and multiplicities[y] < lstCntRpts[y]:
break
else:
return
for idxIntoCmbo in range(idxIntoCmbo, lenCmbo):
x = cmboAsIdxUniqs[idxIntoCmbo]
cmboAsIdxUniqs[idxIntoCmbo] = y
multiplicities[x] -= 1
multiplicities[y] += 1
# print("# multiplicities:", multiplicities)
while y != lenUniqs and multiplicities[y] == lstCntRpts[y]:
y += 1
if y == lenUniqs:
break
I made some experiences when I rewrote some of the recipes of the itertools documentation as C extensions. I think I may have some insights that could help you.
When you write pure Python code it's a tradeoff between speed (generator) and features (iterator).
The yield
functions (known as generators) are for speed and generally they can be written without bothering about internal state. So it's less effort to write them and they are fast because Python just manages all the "state".
The reason generators are faster (or at least not slower) is mostly because:
They implement the __next__
-slot directly (typically tp_iternext
) besides the __next__
-method. In that case Python doesn't have to lookup the __next__
method - that's essentially what makes it faster in the following example:
from itertools import islice
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 173 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 499 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So it's almost 3 times faster just because generators directly populate the __next__
-slot.
A yield
-function and the class have a state, but the yield
function saves and loads the state much faster than you could with a class and attribute access:
def test():
i = 0
while True:
yield i
i += 1
class Test(object):
def __init__(self):
self.val = 0
def __iter__(self):
return self
def __next__(self):
current = self.val
self.val += 1
return current
%timeit list(islice(test(), 1000))
# 296 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test(), 1000))
# 1.22 ms ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time the class is already 4 times slower (compared to the almost 3 times, when no state was involved). That is a cumulative effect: so the more "state" you have, the slower the class variant will be.
So much for the yield
vs. class approach. Note that the actual timing will depend on the kind of operations. For example if the actual code that is run when next
is called is slow (i.e. time.sleep(1)
) then there's almost no difference between generator and class!
If you want a cython iterator class that is fast it has to be a cdef class
. Otherwise you don't get the really fast class. The reason is that only a cdef class
creates an extension type that directly implements the tp_iternext
field! I'll use IPythons %%cython
to compile the code (so I don't have to include the setup):
%%cython
def test():
while True:
yield 1
class Test(object):
def __iter__(self):
return self
def __next__(self):
return 1
cdef class Test_cdef(object):
def __iter__(self):
return self
def __next__(self):
return 1
%timeit list(islice(test(), 1000))
# 113 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(islice(Test(), 1000))
# 407 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(islice(Test_cdef(), 1000))
# 62.8 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The timings already show that the generator and basic class are faster than the pure Python equivalent, but their relative performance roughly stayed the same. However the cdef class
variant beats both of them and that's mainly because the tp_iternext
slot was used instead of just implementing the __next__
method. (Inspect the Cython generated C code if you don't trust me :) )
However it's just 2 times faster than the Python generator, that's not bad but it's not exactly overwhelming. To get really amazing speedups, you'll need to find a way to express your program without Python objects (the less Python objects the more speedup). For example if you use a dictionary for storing the item and it's multiplicity you still store Python objects and any lookup has to be done using python dictionary methods - even if you can call them by C API function instead of having to look up the real methods:
%%cython
cpdef cython_count(items):
cdef dict res = dict()
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
import random
def count(items):
res = {}
for item in items:
if item in res:
res[item] += 1
else:
res[item] = 1
return res
l = [random.randint(0, 100) for _ in range(10000)]
%timeit cython_count(l)
# 2.06 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count(l)
# 3.63 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
There's one catch here, you didn't use collections.Counter
which has an optimized C code (at least in python-3) for this kind of operation:
from collections import Counter
%timeit Counter(l)
# 1.17 ms ± 41.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A quick note here: Don't use something in some_dict.keys()
because the keys()
are list-like in Python2 and ony implement O(n)
contains operations while something in some_dict
is typically O(1)
(both Pythons)! That will make things faster in both versions but especially on Python2:
def count2(items):
res = {}
for item in items:
if item in res.keys(): # with "keys()"
res[item] += 1
else:
res[item] = 1
return res
# Python3
l = [random.randint(0, 100) for _ in range(10000)]
%timeit count(l)
# 3.63 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit count2(l)
# 5.9 ms ± 20 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Python2
l = [random.randint(0, 10000) for _ in range(10000)]
%timeit count(l)
# 100 loops, best of 3: 4.59 ms per loop
%timeit count2(l)
# 1 loop, best of 3: 2.65 s per loop <--- WHOOPS!!!
That shows that you can only hope for something like 3-4 times speedup with Cython (and C extensions) when you use python structures but even minor mistakes like using ".keys()" can cost you much more in terms of performance if used incorrectly.
So what can you do if you want it faster? The answer is relativly easy: Create your own data structure based on C types instead of Python types.
That means you have to think about the design:
uniqComb**
? Do you want integers (the examples say so, but I suppose you want arbitary Python objects).uniqComb**
function to be sortable? You used sorted
but you could also use a OrderedDict
and keep keys in the order of appearance instead of by numerical value.The answers to these questions (these are only the question I immediatly asked myself, there are probably many more!) can help you decide which structure you can use internally. For example with Cython you can interface to C++ and you could use a map
containing integer keys and integer values instead of a dictionary. It's sorted by default so you don't need to manually sort them yourself and you operate on native integers instead of Python objects. But you loose the ability to process arbitary python objects in your uniqComb
and you need to know how to operate with C++ types in Cython. It could be amazingly fast though!
I don't go down that path because I assume you want to support arbitary orderable python types and I stick with the Counter
as starting point but I'll save the multiplicities as integer array.array
s instead of as list
. Let's call it the "least invasive" optimization. It actually doesn't matter much in terms of performance if you use a list
or the array
for lstCntRpts
and multiplicities
because they aren't a bottleneck - but it's a bit faster and saves a bit memory and more importantly it shows how you can include homogeneous array
s with cython:
%%cython
from cpython.list cimport PyList_Size # (most) C API functions can be used with cython!
from array import array
from collections import Counter
cdef class uniqCmboClassIter:
cdef list lstUniqs
cdef Py_ssize_t lenUniqs
cdef int[:] lstCntRpts # memoryview
cdef Py_ssize_t lenCmbo
cdef list cmboAsIdxUniqs
cdef int[:] multiplicities # memoryview
cdef Py_ssize_t idxIntoCmbo
cdef Py_ssize_t idxIntoUniqs
cdef bint stopIteration
cdef Py_ssize_t x
cdef Py_ssize_t y
def __init__(self, lstItems, lenCmbo):
dctCounter = Counter(lstItems)
self.lstUniqs = sorted(dctCounter)
self.lenUniqs = PyList_Size(self.lstUniqs)
self.lstCntRpts = array('i', [dctCounter[item] for item in self.lstUniqs])
self.lenCmbo = lenCmbo
self.cmboAsIdxUniqs = [None] * lenCmbo
self.multiplicities = array('i', [0] * self.lenUniqs)
self.idxIntoCmbo, self.idxIntoUniqs = 0, 0
while self.idxIntoCmbo != self.lenCmbo and self.idxIntoUniqs != self.lenUniqs:
count = min(self.lstCntRpts[self.idxIntoUniqs], self.lenCmbo-self.idxIntoCmbo)
self.cmboAsIdxUniqs[self.idxIntoCmbo : self.idxIntoCmbo + count] = [self.idxIntoUniqs] * count
self.multiplicities[self.idxIntoUniqs] = count
self.idxIntoCmbo += count
self.idxIntoUniqs += 1
# print("self.multiplicities:", self.multiplicities)
# print("self.cmboAsIdxUniqs:", self.cmboAsIdxUniqs)
if self.idxIntoCmbo != self.lenCmbo:
return
self.stopIteration = False
self.x = 0
self.y = 0
return
def __iter__(self):
return self
def __next__(self):
if self.stopIteration is True:
raise StopIteration
nextCmbo = tuple(self.lstUniqs[idxUniqs] for idxUniqs in self.cmboAsIdxUniqs)
for self.idxIntoCmbo in reversed(range(self.lenCmbo)):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.y = self.x + 1
if self.y < self.lenUniqs and self.multiplicities[self.y] < self.lstCntRpts[self.y]:
break
else:
self.stopIteration = True
return nextCmbo
for self.idxIntoCmbo in range(self.idxIntoCmbo, self.lenCmbo):
self.x = self.cmboAsIdxUniqs[self.idxIntoCmbo]
self.cmboAsIdxUniqs[self.idxIntoCmbo] = self.y
self.multiplicities[self.x] -= 1
self.multiplicities[self.y] += 1
# print("# multiplicities:", multiplicities)
while self.y != self.lenUniqs and self.multiplicities[self.y] == self.lstCntRpts[self.y]:
self.y += 1
if self.y == self.lenUniqs:
break
return nextCmbo
You actually didn't share your parameters for the timings but I tried it with some of mine:
from itertools import combinations
import random
import time
def create_values(maximum):
vals = [random.randint(0, maximum) for _ in range(48)]
print('length: ', len(vals))
print('sorted values: ', sorted(vals))
print('uniques: ', len(set(vals)))
print('uniques in percent: {:%}'.format(len(set(vals)) / len(vals)))
return vals
class Timer(object):
def __init__(self):
pass
def __enter__(self):
self._time = time.time()
def __exit__(self, *args, **kwargs):
print(time.time() - self._time)
vals = create_values(maximum=50) # and 22 and 75 and 120
n = 6
with Timer():
list(combinations(vals, n))
with Timer():
list(uniqCmboClassIter(vals, n))
with Timer():
list(uniqCmboClassIterOriginal(vals, n))
with Timer():
list(uniqCmboYieldIterOriginal(vals, n))
length: 48 sorted values: [0, 0, 0, 1, 2, 2, 4, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12, 12, 12, 13, 13, 14, 14, 14, 15, 15, 15, 17, 18, 19, 19, 19, 19, 20, 20, 20, 21, 21, 22, 22] uniques: 21 uniques in percent: 43.750000% 6.250450611114502 0.4217393398284912 4.250436305999756 2.7186365127563477 length: 48 sorted values: [1, 1, 2, 5, 6, 7, 7, 8, 8, 9, 11, 13, 13, 15, 16, 16, 16, 16, 17, 19, 19, 21, 21, 23, 24, 26, 27, 28, 28, 29, 31, 31, 34, 34, 36, 36, 38, 39, 39, 40, 41, 42, 44, 46, 47, 47, 49, 50] uniques: 33 uniques in percent: 68.750000% 6.2034173011779785 4.343803882598877 42.39261245727539 26.65750527381897 length: 48 sorted values: [4, 4, 7, 9, 10, 14, 14, 17, 19, 21, 23, 24, 24, 26, 34, 36, 40, 42, 43, 43, 45, 46, 46, 52, 53, 58, 59, 59, 61, 63, 66, 68, 71, 72, 72, 75, 76, 80, 82, 82, 83, 84, 86, 86, 89, 92, 97, 99] uniques: 39 uniques in percent: 81.250000% 6.859697341918945 10.437987327575684 104.12988543510437 65.25306582450867 length: 48 sorted values: [4, 7, 11, 19, 24, 29, 32, 36, 49, 49, 54, 57, 58, 60, 62, 65, 67, 70, 70, 72, 72, 79, 82, 83, 86, 89, 89, 90, 91, 94, 96, 99, 102, 111, 112, 118, 120, 120, 128, 129, 129, 134, 138, 141, 141, 144, 146, 147] uniques: 41 uniques in percent: 85.416667% 6.484673023223877 13.610010623931885 136.28764533996582 84.73834943771362
It definetly performs much better than the original approaches, actually several times faster with just type declarations. There's probably a lot more that could be optimized (disable bounds checking, using Python C API function calls, using unsigned integers or smaller integers if you know the "maximum" and "minimum" of your multiplicities, ...) - but the fact that it's not much slower than itertools.combinations
even for 80% unique items and much faster than any original implementation is good enough for me. :-)
When you write a generator function using yield
, the overhead of saving off and restoring state is handled by the CPython internals (implemented in C). With __iter__
/__next__
, you have to manage saving and restoring state on each call. In CPython, Python level code is slower than C level built-ins so the extr Python level code involved in the state management (including stuff as simple as accessing attributes of self
via dict
lookups rather than loading local variables, with only array indexing overhead) ends up costing you a lot.
If you implement your own iterator protocol supporting type in a C extension module, you'll bypass this overhead; saving and restoring state should be a matter of a few C level variable accesses (with similar or lesser overhead compared to what Python generator functions incur, which is to say, very little). Effectively, that's what generator functions are, a C extension type that saves and restores the Python frame on each call to tp_iternext
(the C level equivalent of __next__
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With