In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)
The question is how to generate ON-THE-FLY in a generator function all the unique combinations from a non-unique list without the computational expensive need of filtering out duplicates?
I consider combination comboA
to be unique if there is no other combination comboB
for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:
comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")
In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.
The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.
The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...
The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.
I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.
Here some timings to illustrate the current situation:
-----------------
k: 6 len(ls): 48
Combos Used Code Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.036 seconds
12271512 len(list(subbags(ls,k))) : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) : 8.174 seconds
12271512 len(set(combinations(sorted(ls),k))): 7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k))) : 2.030 seconds
1 len(list(subbags(ls,k))) : 0.001 seconds
1 len(list(uniqueCombinations(ls,k))) : 3.619 seconds
1 len(set(combinations(sorted(ls),k))): 2.592 seconds
Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.
My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.
Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter
on) the input. One possible recursive realization is:
def subbags(bag, k):
a = sorted(bag)
n = len(a)
sub = []
def index_of_next_unique_item(i):
j = i + 1
while j < n and a[j] == a[i]:
j += 1
return j
def combinate(i):
if len(sub) == k:
yield tuple(sub)
elif n - i >= k - len(sub):
sub.append(a[i])
yield from combinate(i + 1)
sub.pop()
yield from combinate(index_of_next_unique_item(i))
yield from combinate(0)
bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1
print(sorted(bag), k)
print('---')
for i, subbag in enumerate(subbags(bag, k)):
print(subbag)
print('---')
print(i + 1)
Output:
[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6
Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.
The current state-of-the-art inspired initially by a 50 than by a 100 reps bounties is at the moment (instead of a Python extension module written entirely in C):
An efficient algorithm and implementation that is better than the obvious
(set + combinations)
approach in the best (and average) case, and is competitive with it in the worst case.
It seems to be possible to fulfill this requirement using a kind of "fake it before you make it" approach. The current state-of-the-art is that there are two generator function algorithms available for solving the problem of getting unique combinations in case of a non-unique list. The below provided algorithm combines both of them what becomes possible because it seems to exist a threshold value for percentage of unique items in the list which can be used for appropriate switching between the two algorithms. The calculation of the percentage of uniqueness is done with so tiny amount of computation time that it even doesn't clearly show up in the final results due to common variation of the taken timing.
def iterFastUniqueCombos(lstList, comboSize, percUniqueThresh=60):
lstListSorted = sorted(lstList)
lenListSorted = len(lstListSorted)
percUnique = 100.0 - 100.0*(lenListSorted-len(set(lstListSorted)))/lenListSorted
lstComboCandidate = []
setUniqueCombos = set()
def idxNextUnique(idxItemOfList):
idxNextUniqueCandidate = idxItemOfList + 1
while (
idxNextUniqueCandidate < lenListSorted
and
lstListSorted[idxNextUniqueCandidate] == lstListSorted[idxItemOfList]
): # while
idxNextUniqueCandidate += 1
idxNextUnique = idxNextUniqueCandidate
return idxNextUnique
def combinate(idxItemOfList):
if len(lstComboCandidate) == sizeOfCombo:
yield tuple(lstComboCandidate)
elif lenListSorted - idxItemOfList >= sizeOfCombo - len(lstComboCandidate):
lstComboCandidate.append(lstListSorted[idxItemOfList])
yield from combinate(idxItemOfList + 1)
lstComboCandidate.pop()
yield from combinate(idxNextUnique(idxItemOfList))
if percUnique > percUniqueThresh:
from itertools import combinations
allCombos = combinations(lstListSorted, comboSize)
for comboCandidate in allCombos:
if comboCandidate in setUniqueCombos:
continue
yield comboCandidate
setUniqueCombos.add(comboCandidate)
else:
yield from combinate(0)
#:if/else
#:def iterFastUniqueCombos()
The below provided timings show that the above iterFastUniqueCombos()
generator function provides a clear advantage
over uniqueCombinations()
variant in case the list has less than 60 percent of unique elements and is not worse as
the on (set + combinations)
based uniqueCombinations()
generator function in the opposite case where it gets much faster than the iterUniqueCombos()
one (due to switching between
the (set + combinations)
and the (no lookups)
variant at 60% threshold for amount of unique elements in the list):
=========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 1 percUnique 2
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.04968 seconds.
Combos: 1 print(len(list( iterUniqueCombos(lst,k)))) : 0.00011 seconds.
Combos: 1 print(len(list( iterFastUniqueCombos(lst,k)))) : 0.00008 seconds.
Combos: 1 print(len(list( uniqueCombinations(lst,k)))) : 3.61812 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 48 percUnique 100
Combos: 12271512 print(len(list(combinations(lst,k)))) : 1.99383 seconds.
Combos: 12271512 print(len(list( iterUniqueCombos(lst,k)))) : 49.72461 seconds.
Combos: 12271512 print(len(list( iterFastUniqueCombos(lst,k)))) : 8.07997 seconds.
Combos: 12271512 print(len(list( uniqueCombinations(lst,k)))) : 8.11974 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 27 percUnique 56
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.02774 seconds.
Combos: 534704 print(len(list( iterUniqueCombos(lst,k)))) : 1.60052 seconds.
Combos: 534704 print(len(list( iterFastUniqueCombos(lst,k)))) : 1.62002 seconds.
Combos: 534704 print(len(list( uniqueCombinations(lst,k)))) : 3.41156 seconds.
========== sizeOfCombo: 6 sizeOfList: 48 noOfUniqueInList 31 percUnique 64
Combos: 12271512 print(len(list(combinations(lst,k)))) : 2.03539 seconds.
Combos: 1114062 print(len(list( iterUniqueCombos(lst,k)))) : 3.49330 seconds.
Combos: 1114062 print(len(list( iterFastUniqueCombos(lst,k)))) : 3.64474 seconds.
Combos: 1114062 print(len(list( uniqueCombinations(lst,k)))) : 3.61857 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With