Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FAST unique combinations (from list with duplicates) WITHOUT LOOKUPS

In spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)

The question is how to generate ON-THE-FLY in a generator function all the unique combinations from a non-unique list without the computational expensive need of filtering out duplicates?

I consider combination comboA to be unique if there is no other combination comboB for which sorted lists for both combinations are the same. Let's give an example of code checking for such uniqueness:

comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A" if sorted(comboA)==sorted(comboB) else "A is unique compared to B")

In the above given example B is a duplicate of A and the print() prints B is a duplicate of A.

The problem of getting a generator function capable of providing unique combinations on-the-fly in case of a non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.

The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...

The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.

I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.

Here some timings to illustrate the current situation:

-----------------
 k: 6 len(ls): 48
Combos   Used Code                               Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.036 seconds
12271512 len(list(subbags(ls,k)))            : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) :  8.174 seconds
12271512 len(set(combinations(sorted(ls),k))):  7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.030 seconds
       1 len(list(subbags(ls,k)))            :  0.001 seconds
       1 len(list(uniqueCombinations(ls,k))) :  3.619 seconds
       1 len(set(combinations(sorted(ls),k))):  2.592 seconds

Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.

My interpretation of the results above is that a pure Python function (not using any C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C/C++ code for a Python .so extension module providing the required functionality.

like image 847
Claudio Avatar asked Apr 07 '17 16:04

Claudio


2 Answers

Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:

def subbags(bag, k):
    a = sorted(bag)
    n = len(a)
    sub = []

    def index_of_next_unique_item(i):
        j = i + 1

        while j < n and a[j] == a[i]:
            j += 1

        return j

    def combinate(i):
        if len(sub) == k:
            yield tuple(sub)
        elif n - i >= k - len(sub):
            sub.append(a[i])
            yield from combinate(i + 1)
            sub.pop()
            yield from combinate(index_of_next_unique_item(i))

    yield from combinate(0)

bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1

print(sorted(bag), k)
print('---')

for i, subbag in enumerate(subbags(bag, k)):
    print(subbag)

print('---')
print(i + 1)

Output:

[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6

Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.

like image 144
lazy dog Avatar answered Nov 19 '22 12:11

lazy dog


The current state-of-the-art inspired initially by a 50 than by a 100 reps bounties is at the moment (instead of a Python extension module written entirely in C):

An efficient algorithm and implementation that is better than the obvious (set + combinations) approach in the best (and average) case, and is competitive with it in the worst case.

It seems to be possible to fulfill this requirement using a kind of "fake it before you make it" approach. The current state-of-the-art is that there are two generator function algorithms available for solving the problem of getting unique combinations in case of a non-unique list. The below provided algorithm combines both of them what becomes possible because it seems to exist a threshold value for percentage of unique items in the list which can be used for appropriate switching between the two algorithms. The calculation of the percentage of uniqueness is done with so tiny amount of computation time that it even doesn't clearly show up in the final results due to common variation of the taken timing.

def iterFastUniqueCombos(lstList, comboSize, percUniqueThresh=60):

    lstListSorted = sorted(lstList)
    lenListSorted = len(lstListSorted)

    percUnique = 100.0 - 100.0*(lenListSorted-len(set(lstListSorted)))/lenListSorted

    lstComboCandidate = []
    setUniqueCombos = set()

    def idxNextUnique(idxItemOfList):
        idxNextUniqueCandidate = idxItemOfList + 1
        while (
                idxNextUniqueCandidate < lenListSorted 
                    and 
                lstListSorted[idxNextUniqueCandidate] == lstListSorted[idxItemOfList]
        ): # while
            idxNextUniqueCandidate += 1
        idxNextUnique = idxNextUniqueCandidate
        return idxNextUnique

    def combinate(idxItemOfList):
        if len(lstComboCandidate) == sizeOfCombo:
            yield tuple(lstComboCandidate)
        elif lenListSorted - idxItemOfList >= sizeOfCombo - len(lstComboCandidate):
            lstComboCandidate.append(lstListSorted[idxItemOfList])
            yield from combinate(idxItemOfList + 1)
            lstComboCandidate.pop()
            yield from combinate(idxNextUnique(idxItemOfList))

    if percUnique > percUniqueThresh:
        from itertools import combinations
        allCombos = combinations(lstListSorted, comboSize)
        for comboCandidate in allCombos:
            if comboCandidate in setUniqueCombos:
                continue
            yield comboCandidate
            setUniqueCombos.add(comboCandidate)
    else:
        yield from combinate(0)
    #:if/else    
#:def iterFastUniqueCombos()

The below provided timings show that the above iterFastUniqueCombos() generator function provides a clear advantage over uniqueCombinations() variant in case the list has less than 60 percent of unique elements and is not worse as the on (set + combinations) based uniqueCombinations() generator function in the opposite case where it gets much faster than the iterUniqueCombos() one (due to switching between the (set + combinations) and the (no lookups) variant at 60% threshold for amount of unique elements in the list):

===========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 1   percUnique   2
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.04968 seconds.
Combos:        1  print(len(list(      iterUniqueCombos(lst,k)))) :   0.00011 seconds.
Combos:        1  print(len(list(  iterFastUniqueCombos(lst,k)))) :   0.00008 seconds.
Combos:        1  print(len(list(    uniqueCombinations(lst,k)))) :   3.61812 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 48   percUnique 100
Combos: 12271512  print(len(list(combinations(lst,k))))           :   1.99383 seconds.
Combos: 12271512  print(len(list(      iterUniqueCombos(lst,k)))) :  49.72461 seconds.
Combos: 12271512  print(len(list(  iterFastUniqueCombos(lst,k)))) :   8.07997 seconds.
Combos: 12271512  print(len(list(    uniqueCombinations(lst,k)))) :   8.11974 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 27   percUnique  56
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.02774 seconds.
Combos:   534704  print(len(list(      iterUniqueCombos(lst,k)))) :   1.60052 seconds.
Combos:   534704  print(len(list(  iterFastUniqueCombos(lst,k)))) :   1.62002 seconds.
Combos:   534704  print(len(list(    uniqueCombinations(lst,k)))) :   3.41156 seconds.

==========  sizeOfCombo: 6   sizeOfList: 48   noOfUniqueInList 31   percUnique  64
Combos: 12271512  print(len(list(combinations(lst,k))))           :   2.03539 seconds.
Combos:  1114062  print(len(list(      iterUniqueCombos(lst,k)))) :   3.49330 seconds.
Combos:  1114062  print(len(list(  iterFastUniqueCombos(lst,k)))) :   3.64474 seconds.
Combos:  1114062  print(len(list(    uniqueCombinations(lst,k)))) :   3.61857 seconds.
like image 20
Claudio Avatar answered Nov 19 '22 11:11

Claudio