Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Picking unordered combinations from pools with overlap

I have pools of values and I would like to generate every possible unordered combination by picking from certain pools.

For example, I wanted to pick from pool 0, pool 0, and pool 1:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
>>> part = (0, 0, 1)
>>> list(product(*(pools[i] for i in part)))
[(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)]

This generates every possible combination by picking from pool 0, pool 0, and pool 1.

However order doesn't matter to me, so many of the combinations are actually duplicates. For example, since I used a Cartesian product, both (1, 2, 4) and (2, 1, 4) are generated.

I came up with a simple method to mitigate this issue. For members picked from a single pool, I select without ordering using combinations_with_replacement. I count how many times I want to draw from each pool. The code looks like this:

cnt = Counter()
for ind in part: cnt[ind] += 1
blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt]
return [list(chain(*combo)) for combo in product(*blocks)]

This reduces ordering duplicates if I happen to choose from the same pool multiple times. However all the pools have lots of overlap, and using combinations_with_replacement on multiple pools merged would generate some invalid combinations. Is there a more efficient method to generate unordered combinations?

Edit: Extra info about the inputs: The number of parts and pools is small (~5 and ~20), and for simplicity, each element is an integer. The actual problem I have already solved so this is just for academic interest. Let's say there are thousands hundreds of integers in each pool but some pools are small and only have dozens. So some kind of union or intersection seems to be the way to go.

like image 268
qwr Avatar asked Aug 14 '18 05:08

qwr


4 Answers

This isn't "an answer" so much as a spur to think harder ;-) For concreteness, I'll wrap the OP's code, slightly respelled, in a function that also weeds out duplicates:

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import chain, product
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    seen = set()
    cnt = Counter(ixs) # map index to count
    blocks = [cwr(pools[i], count) for i, count in cnt.items()]
    for t in product(*blocks):
        t = tuple(sorted(chain(*t)))
        if t not in seen:
            seen.add(t)
            yield t

I don't fear sorting here - it's memory-efficient, and for small tuples is likely faster than all the overheads involved in creating a Counter object.

But regardless of that, the point here is to emphasize the real value the OP got by reformulating the problem to use combinations_with_replacement (cwr). Consider these inputs:

N = 64
pools = [[0, 1]]
ixs = [0] * N

There are only 65 unique results, and the function generates them instantaneously, with no internal duplicates at all. On the other hand, the essentially identical

pools = [[0, 1]] * N
ixs = range(N)

also has the same 65 unique results, but essentially runs forever (as would, e.g, the other answers so far given), slogging through 2**64 possibilities. cwr doesn't help here because each pool index appears only once.

So there's astronomical room for improvement over any solution that "merely" weeds out duplicates from a full Cartesian product, and some of that can be won by doing what the OP already did.

It seems to me the most promising approach would be to write a custom generator (not one relying primarily on itertools functions) that generated all possibilities in lexicographic order to begin with (so, by construction, no duplicates would be created to begin with). But that requires some "global" analysis of the input pools first, and the code I started on quickly got more complex than I can make time to wrestle with now.

One based on @user2357112's answer

Combining cwr with @user2357112's incremental de-duplicating gives a brief algorithm that runs fast on all the test cases I have. For example, it's essentially instantaneous for both spellings of the [0, 1] ** 64 examples above, and runs the example at the end of @Joseph Wood's answer approximately as fast as he said his C++ code ran (0.35 seconds on my box under Python 3.7.0, and, yes, found 162295 results):

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    result = {()}
    for i, count in Counter(ixs).items():
        result = {tuple(sorted(old + new))
                  for new in cwr(pools[i], count)
                  for old in result}
    return result

To make it easier for other Pythonistas to try the last example, here's the input as executable Python:

pools = [[1, 10, 14, 6],
         [7, 2, 4, 8, 3, 11, 12],
         [11, 3, 13, 4, 15, 8, 6, 5],
         [10, 1, 3, 2, 9, 5,  7],
         [1, 5, 10, 3, 8, 14],
         [15, 3, 7, 10, 4, 5, 8, 6],
         [14, 9, 11, 15],
         [7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))

However, the OP later added that they typically have about 20 pools, each with some thousands of elements. 1000**20 = 1e60 is waaaaay out of practical reach for any approach that builds the full Cartesian product, no matter how cleverly it weeds out duplicates. It remains clear as mud how many they expect to be duplicates, though, so also clear as mud whether this kind of "incremental de-duplicating" is good enough to be practical.

Ideally we'd have a generator that yielded one result at a time, in lexicographic order.

Lazy lexicographic one-at-a-time generation

Building on the incremental de-duplication, suppose we have a strictly increasing (lexicographic) sequence of sorted tuples, append the same tuple T to each, and sort each again. Then the derived sequence is still in strictly increasing order. For example, in the left column we have the 10 unique pairs from range(4), and in the right column what happens after we append (and sort again) 2 to each:

00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233

They started in sorted order, and the derived triples are also in sorted order. I'll skip the easy proof (sketch: if t1 and t2 are adjacent tuples, t1 < t2, and let i be the smallest index such that t1[i] != t2[i]. Then t1[i] < t2[i] (what "lexicographic <" means). Then if you throw x into both tuples, proceed by cases: is x <= t1[i]? between t1[i] and t2[i]? is x >= t2[i]? In each case it's easy to see that the first derived tuple remains strictly less then the second derived tuple.)

So supposing we have a sorted sequence result of all unique sorted tuples from some number of pools, what happens when we add elements of a new pool P into the tuples? Well, as above,

[tuple(sorted(old + (P[0],))) for old in result]

is also sorted, and so is

[tuple(sorted(old + (P[i],))) for old in result]

for all i in range(len(P)). These guaranteed already-sorted sequences can be merged via heapq.merge(), and another generator (killdups() below) run on the merge result to weed out duplicates on the fly. There's no need to, e.g., keep a set of all tuples seen so far. Because the output of the merge is non-decreasing, it's sufficient just to check whether the next result is the same as the last result output.

Getting this to work lazily is delicate. The entire result-so-far sequence has to be accessed by each element of the new pool being added, but we don't want to materialize the whole thing in one gulp. Instead itertools.tee() allows each element of the next pool to traverse the result-so-far sequence at its own pace, and automatically frees memory for each result item after all new pool elements have finished with it.

The function build1() (or some workalike) is needed to ensure that the right values are accessed at the right times. For example, if the body of build1() is pasted in inline where it's called, the code will fail spectacularly (the body would access the final values bound to rcopy and new instead of what they were bound to at the time the generator expression was created).

In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).

Tech note: don't fear sorted() here. The appending is done via old + new for a reason: old is already sorted, and new is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N).

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import tee
    from collections import Counter
    from heapq import merge

    def killdups(xs):
        last = None
        for x in xs:
            if x != last:
                yield x
                last = x

    def build1(rcopy, new):
        return (tuple(sorted(old + new)) for old in rcopy)

    assert all(0 <= i < len(pools) for i in ixs)
    result = [()]
    for i, count in Counter(ixs).items():
        poolelts = list(cwr(pools[i], count))
        xs = [build1(rcopy, new)
              for rcopy, new in zip(tee(result, len(poolelts)),
                                    poolelts)]
        result = killdups(merge(*xs))
    return result

2 inputs

Turns out that for the 2-input case there's an easy approach derived from set algebra. If x and y are the same, cwr(x, 2) is the answer. If x and y are disjoint, product(x, y). Else the intersection c of x and y is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c, x-c, and y-c: cwr(c, 2), product(x-c, c), product(y-c, c), and product(x-c, y-c). Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2) and product(x-c, c) because every tuple in the latter contains an element from x-c, but every tuple in the former contains elements only from c, and x-c and c are disjoint by construction. There are no duplicates within product(x-c, y-c) because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x and y, contradicting that x-c has no element in the intersection). Etc.

Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.

Even at just 3 inputs, it's not clear to me how to get the right result for

[1, 2], [2, 3], [1, 3]

The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3) appears twice (as (1, 2, 3) and again as (2, 3, 1)). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.

Here's an implementation:

def pair(x, y):
    from itertools import product, chain
    from itertools import combinations_with_replacement
    x = set(x)
    y = set(y)
    c = x & y
    chunks = []
    if c:
        x -= c
        y -= c
        chunks.append(combinations_with_replacement(c, 2))
        if x:
            chunks.append(product(x, c))
        if y:
            chunks.append(product(y, c))
    if x and y:
        chunks.append(product(x, y))
    return chain.from_iterable(chunks)

A Proof-of-Concept Based on Maximal Matching

This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!

This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it wasn't immediately clear how to accomplish that efficiently. I did have code sitting around to find a match in a bipartite graph, which required only a few changes to use in this context.

So this tries plausible result prefixes in lexicographic order, as in @JosephWood's sketch, and for each sees whether it's actually possible to construct via checking whether a bipartite-graph match exists.

So while the details of @Leon's sketch are largely unimplemented here, the visible behaviors are much the same: it produces results in lexicographic order, it doesn't need to check for duplicates, it's a lazy generator, peak memory use is proportional to the sum of the lengths of the pools, it can obviously be parallelized in many ways (set different processes to work on different regions of the result space), and the key to making it majorly faster lies in reducing the massive amounts of redundant work done by the graph-matching function (a great deal of what it does on each call merely reproduces what it did on the previous call).

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import chain, repeat, islice

    elt2pools = defaultdict(set)
    npools = 0
    for i, count in Counter(ixs).items():
        set_indices = set(range(npools, npools + count))
        for elt in pools[i]:
            elt2pools[elt] |= set_indices
        npools += count
    elt2count = {elt : len(ps) for elt, ps in elt2pools.items()}

    cands = sorted(elt2pools.keys())
    ncands = len(cands)

    result = [None] * npools

    # Is it possible to match result[:n] + [elt]*count?
    # We already know it's possible to match result[:n], but
    # this code doesn't exploit that.
    def match(n, elt, count):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y in y2x:
                        if extend(y2x[y], seen):
                            y2x[y] = x
                            return True
                    else:
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in chain(islice(result, n), repeat(elt, count)):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)
        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return False
        return True

    def inner(i, j): # fill result[j:] with elts from cands[i:]
        if j >= npools:
            yield tuple(result)
            return
        for i in range(i, ncands):
            elt = cands[i]
            # Find the most times `elt` can be added.
            count = min(elt2count[elt], npools - j)
            while count:
                if match(j, elt, count):
                    break
                count -= 1
            # Since it can be added `count` times, it can also
            # be added any number of times less than `count`.
            for k in range(count):
                result[j + k] = elt
            while count:
                yield from inner(i + 1, j + count)
                count -= 1

    return inner(0, 0)

EDIT: note that there's a potential trap here, illustrated by the pair of pools range(10_000) and range(100_000). After producing (9999, 99999), the first position increments to 10000, and then it continues for a very long time deducing that there's no match for any of the possibilities in 10001 .. 99999 in the second position; and then for 10001 in the first position no match for any of the possibilities in 10002 .. 99999 in the second position; and so on. @Leon's scheme instead would have noted that range(10_000) was the only free pool remaining having picked 10000 in the first position, and noted at once then that range(10_000) doesn't contain any values greater than 10000. It would apparently need to do that again for 10001, 10002, ..., 99999 in the first position. That's a linear-time rather than quadratic-time waste of cycles, but a waste all the same. Moral of the story: don't trust anything until you have actual code to try ;-)

And One Based on @Leon's Scheme

Following is a more-than-less faithful implementation of @Leon's ideas. I like the code better than my "proof of concept" (POC) code just above, but was surprised to find that the new code runs significantly slower (a factor of 3 to 4 times slower on a variety of cases akin to @JospephWood's randomized example) relative to a comparably "optimized" variant of the POC code.

The primary reason appears to be more calls to the matching function. The POC code called that once per "plausible" prefix. The new code doesn't generate any impossible prefixes, but for each prefix it generates may need to make multiple match() calls to determine the possibly smaller set of free pools remaining. Perhaps there's a cleverer way to do that.

Note that I added one twist: if a free pool's elements are all smaller than the last element of the prefix so far, it remains "a free pool" with respect to the prefix, but it's useless because none of its elements can appear in the candidates. This doesn't matter to the outcome, but it means the pool remains in the set of free pools for all remaining recursive calls, which in turn means they can waste time determining that it's still a "free pool". So when a free pool can no longer be used for anything, this version removes it from the set of free pools. This gave a significant speedup.

Note: there are many ways to try matching, some of which have better theoretical O() worst-case behavior. In my experience, simple depth-first (as here) search runs faster in real life on typical cases. But it depends very much on characteristics of what "typical" graphs look like in the application at hand. I haven't tried other ways here.

Bottom lines, ignoring the "2 inputs" special-case code:

  • Nothing here beats incremental de-duplication for speed, if you have the RAM. But nothing is worse than that for peak memory burden.

  • Nothing beats the matching-based approaches for frugal memory burden. They're in an entirely different universe on that measure. They're also the slowest, although at least in the same universe ;-)

The code:

def matchgen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(list)
    allpools = []
    npools = 0
    for i, count in Counter(ixs).items():
        indices = list(range(npools, npools + count))
        plist = sorted(pools[i])
        for elt in plist:
            elt2pools[elt].extend(indices)
        for i in range(count):
            allpools.append(plist)
        npools += count
    pools = allpools
    assert npools == len(pools)

    result = [None] * npools

    # Is it possible to match result[:n] not using pool
    # bady?  If not, return None.  Else return a matching,
    # a dict whose keys are pool indices and whose values
    # are a permutation of result[:n].
    def match(n, bady):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x and y != bady:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        for x in freexs:
            if not extend(x, {bady}):
                return None
        return y2x

    def inner(j, freepools): # fill result[j:]
        from bisect import bisect_left
        if j >= npools:
            yield tuple(result)
            return
        if j:
            new_freepools = set()
            allcands = set()
            exhausted = set()  # free pools with elts too small
            atleast = result[j-1]
            for pi in freepools:
                if pi not in new_freepools:
                    m = match(j, pi)
                    if not m:  # match must use pi
                        continue
                    # Since `m` is a match to result[:j],
                    # any pool in freepools it does _not_
                    # use must still be free.
                    new_freepools |= freepools - m.keys()
                    assert pi in new_freepools
                # pi is free with respect to result[:j].
                pool = pools[pi]
                if pool[-1] < atleast:
                    exhausted.add(pi)
                else:
                    i = bisect_left(pool, atleast)
                    allcands.update(pool[i:])
            if exhausted:
                freepools -= exhausted
                new_freepools -= exhausted
        else: # j == 0
            new_freepools = freepools
            allcands = elt2pools.keys()

        for result[j] in sorted(allcands):
            yield from inner(j + 1, new_freepools)

    return inner(0, set(range(npools)))

Note: this has its own classes of "bad cases". For example, passing 128 copies of [0, 1] consumes about 2 minutes(!) of time on my box to find the 129 results. The POC code takes under a second, while some of the non-matching approaches appear instantaneous.

I won't go into detail about why. Suffice it to say that because all the pools are the same, they all remain "free pools" no matter how deep the recursion goes. match() never has a hard time, always finding a complete match for the prefix in its first (greedy) pass, but even that takes time proportional to the square of the current prefix length (== the current recursion depth).

Pragmatic hybrid

One more here. As noted before, the matching-based approaches suffer some from the expense of graph matching as a fundamental operation repeated so often, and have some unfortunate bad cases pretty easy to stumble into.

Highly similar pools cause the set of free pools to shrink slowly (or not at all). But in that case the pools are so similar that it rarely matters which pool an element is taken from. So the approach below doesn't try to keep exact track of free pools, picks arbitrary pools so long as such are obviously available, and resorts to graph-matching only when it gets stuck. That seems to work well. As an extreme example, the 129 results from 128 [0, 1] pools are delivered in less than a tenth of second instead of in two minutes. It turns out it never needs to do graph-matching in that case.

Another problem with the POC code (and less so for the other match-based approach) was the possibility of spinning wheels for a long time after the last result was delivered. A pragmatic hack solves that one completely ;-) The last tuple of the sequence is easily computed in advance, and code raises an internal exception to end everything immediately after the last tuple is delivered.

That's it for me! A generalization of the "two inputs" case would remain very interesting to me, but all the itches I got from the other approaches have been scratched now.

def combogen(pools, ixs):
    from collections import Counter
    from collections import defaultdict
    from itertools import islice

    elt2pools = defaultdict(set)
    npools = 0
    cands = []
    MAXTUPLE = []
    for i, count in Counter(ixs).items():
        indices = set(range(npools, npools + count))
        huge = None
        for elt in pools[i]:
            elt2pools[elt] |= indices
            for i in range(count):
                cands.append(elt)
            if huge is None or elt > huge:
                huge = elt
        MAXTUPLE.extend([huge] * count)
        npools += count
    MAXTUPLE = tuple(sorted(MAXTUPLE))
    cands.sort()
    ncands = len(cands)
    ALLPOOLS = set(range(npools))
    availpools = ALLPOOLS.copy()
    result = [None] * npools

    class Finished(Exception):
        pass

    # Is it possible to match result[:n]? If not, return None.  Else
    # return a matching, a dict whose keys are pool indices and
    # whose values are a permutation of result[:n].
    def match(n):

        def extend(x, seen):
            for y in elt2pools[x]:
                if y not in seen:
                    seen.add(y)
                    if y not in y2x or extend(y2x[y], seen):
                        y2x[y] = x
                        return True
            return False

        y2x = {}
        freexs = []
        # A greedy pass first to grab easy matches.
        for x in islice(result, n):
            for y in elt2pools[x]:
                if y not in y2x:
                    y2x[y] = x
                    break
            else:
                freexs.append(x)

        # Now do real work.
        seen = set()
        for x in freexs:
            seen.clear()
            if not extend(x, seen):
                return None
        return y2x

    def inner(i, j):  # fill result[j:] with cands[i:]
        nonlocal availpools
        if j >= npools:
            r = tuple(result)
            yield r
            if r == MAXTUPLE:
                raise Finished
            return
        restore_availpools = None
        last = None
        jp1 = j + 1
        for i in range(i, ncands):
            elt = cands[i]
            if elt == last:
                continue
            last = result[j] = elt
            pools = elt2pools[elt] & availpools
            if pools:
                pool = pools.pop() # pick one - arbitrary
                availpools.remove(pool)
            else:
                # Find _a_ matching, and if that's possible fiddle
                # availpools to pretend that's the one we used all
                # along.
                m = match(jp1)
                if not m: # the prefix can't be extended with elt
                    continue
                if restore_availpools is None:
                    restore_availpools = availpools.copy()
                availpools = ALLPOOLS - m.keys()
                # Find a pool from which elt was taken.
                for pool, v in m.items():
                    if v == elt:
                        break
                else:
                    assert False
            yield from inner(i+1, jp1)
            availpools.add(pool)

        if restore_availpools is not None:
            availpools = restore_availpools

    try:
        yield from inner(0, 0)
    except Finished:
        pass
like image 193
Tim Peters Avatar answered Nov 15 '22 19:11

Tim Peters


This is a difficult problem. I think your best bet in the general case is to implement a hash table where the key is a multiset and the value is your actual combination. This is similar to what @ErikWolf mentioned, however this methods avoids producing duplicates in the first place so filtering is not required. It also returns correct result when we encounter multisets.

There is a faster solution that I am teasing now, but saving for later. Bear with me.

As mentioned in the comments, one approach that seems viable is to combine all of the pools and simply generate combinations of this combined pool choose the number of pools. You would need a tool that is capable of generating combinations of multisets, which there is one that I know of that is available in python. It is in the sympy library from sympy.utilities.iterables import multiset_combinations. The problem with this is that we still produce duplicate values and worse, we produce results that are impossible to obtain with an analogous set and product combo. For example, if we were to do something like sort and combine all of the pools from the OP and apply the following:

list(multiset_permutations([1,2,2,3,3,4,4,5]))

A couple of the results would be [1 2 2] and [4 4 5] which are both impossible to obtain from [[1, 2, 3], [2, 3, 4], [3, 4, 5]].

Outside of special cases, I don't see how it is possible to avoid checking every possible product. I hope I am wrong.

Algorithm Overview
The main idea is to map combinations of our product of vectors to unique combinations without having to filter out duplicates. The example given by the OP (i.e. (1, 2, 3) and (1, 3, 2)) should only map to one value (either one of them, as order doesn't matter). We note that the two vectors are identical sets. Now, we also have situations like :

vec1 = (1, 2, 1)
vec2 = (2, 1, 1)
vec3 = (2, 2, 1)

We need vec1 and vec2 to map to the same value whereas vec3 needs to map to its own value. This is the problem with sets as all of these are equivalent sets (with sets, the elements are unique thus {a, b, b} and {a, b} are equivalent).

This is where multisets come into play. With multisets, (2, 2, 1) and (1, 2, 1) are distinct, however (1, 2, 1) and (2, 1, 1) are the same. This is good. We now have a method to generate unique keys.

As I am not a python programmer, so I will proceed in C++.

We will have some issues if we try to implement everything above exactly as is. As far as I know, you can't have std::multiset<int> as the key portion for a std::unordered_map. However, we can for a regular std::map. It isn't as performant as a hash table underneath (it is actually a red-black tree), but it still gives decent performance. Here it is:

void cartestionCombos(std::vector<std::vector<int> > v, bool verbose) {

    std::map<std::multiset<int>, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    std::multiset<int> key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key.clear();

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key.insert(value[k]);
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key.clear();
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key.insert(value[k]);
    }

    cartCombs.insert({key, value});

    if (verbose) {
        int count = 1;

        for (std::pair<std::multiset<int>, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}

With test cases of 8 vectors of lengths from 4 to 8 filled with random integers from 1 to 15, the above algorithm runs in about 5 seconds on my computer. That's not bad considering we are looking at nearly 2.5 million total results from our product, but we can do better. But how?

The best performance is given by std::unordered_map with a key that is built in constant time. Our key above is built in logarithmic time (multiset, map and hash map complexity). So the question is, how can we overcome these hurdles?

Best Performance

We know we must abandon std::multiset. We need some sort of object that has a commutative type property while also giving unique results.

Enter the Fundamental Theorem of Arithmetic

It states that every number can be uniquely represented (up to the order of the factors) by the product of primes numbers. This is sometimes called the prime decomposition.

So now, we can simply proceed as before but instead of constructing a multiset, we map each index to a prime number and multiply the result. This will give us a constant time construction for our key. Here is an example showing the power of this technique on the examples we created earlier above (N.B. P below is a list of primes numbers... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to            product
vec1 = (1, 2, 1)    -->>    P[1], P[2], P[1]   --->>   3, 5, 3    -->>    45
vec2 = (2, 1, 1)    -->>    P[2], P[1], P[1]   --->>   5, 3, 3    -->>    45
vec3 = (2, 2, 1)    -->>    P[2], P[2], P[1]   --->>   5, 5, 3    -->>    75

This is awesome!! vec1 and vec2 map to the same number, whereas vec3 gets mapped to a different value just as we wished.

void cartestionCombosPrimes(std::vector<std::vector<int> > v, 
                        std::vector<int> primes,
                        bool verbose) {

    std::unordered_map<int64_t, std::vector<int> > cartCombs;

    unsigned long int len = v.size();
    unsigned long int myProd = 1;
    std::vector<unsigned long int> s(len);

    for (std::size_t j = 0; j < len; ++j) {
        myProd *= v[j].size();
        s[j] = v[j].size() - 1;
    }

    unsigned long int loopLim = myProd - 1;
    std::vector<std::vector<int> > res(myProd, std::vector<int>());
    std::vector<unsigned long int> myCounter(len, 0);
    std::vector<int> value(len, 0);
    int64_t key;

    for (std::size_t j = 0; j < loopLim; ++j) {
        key = 1;

        for (std::size_t k = 0; k < len; ++k) {
            value[k] = v[k][myCounter[k]];
            key *= primes[value[k]];
        }

        cartCombs.insert({key, value});

        int test = 0;
        while (myCounter[test] == s[test]) {
            myCounter[test] = 0;
            ++test;
        }

        ++myCounter[test];
    }

    key = 1;
    // Get last possible combination
    for (std::size_t k = 0; k < len; ++k) {
        value[k] = v[k][myCounter[k]];
        key *= primes[value[k]];
    }

    cartCombs.insert({key, value});
    std::cout << cartCombs.size() << std::endl;

    if (verbose) {
        int count = 1;

        for (std::pair<int, std::vector<int> > element : cartCombs) {
            std::string tempStr;

            for (std::size_t k = 0; k < len; ++k)
                tempStr += std::to_string(element.second[k]) + ' ';

            std::cout << count << " : " << tempStr << std::endl;
            ++count;
        }
    }
}

On the same example above that would generate nearly 2.5 million products, the above algorithm returns the same result in less than 0.3 seconds.

There are a couple of caveats with this latter method. We must have our primes generated a priori and if we have many vectors in our Cartesian product, the key could grow beyond the bounds of int64_t. The first issue shouldn't be that difficult to overcome as there are many resources (libraries, lookup tables, etc.) available for generating prime numbers. I'm not really sure, but I've read that the latter issue shouldn't be a problem for python as integers have arbitrary precision (Python integer ranges).

We also have to deal with the fact that our source vectors might not be nice integer vectors with small values. This can be remedied by ranking all of the elements across all vectors before you proceed. For example, given the following vectors:

vec1 = (12345.65, 5, 5432.11111)
vec2 = (2222.22, 0.000005, 5)
vec3 = (5, 0.5, 0.8)

Ranking them, we would obtain:

rank1 = (6, 3, 5)
rank2 = (4, 0, 3)
rank3 = (3, 1, 2)

And now, these can be used in place of the actual values to create your key. The only portion of the code that would change would be the for loops that build the key (and of course the rank object that would need to be created):

for (std::size_t k = 0; k < len; ++k) {
    value[k] = v[k][myCounter[k]];
    key *= primes[rank[k][myCounter[k]]];
}

Edit:
As some of the commenters have pointed out, the above method disguises the fact that all products must be generated. I should have said that the first time around. Personally, I don't see how it can be avoided given the many different presentations.

Also, in case anybody is curious, here is the test case I used above:

[1 10 14  6],
[7  2  4  8  3 11 12],
[11  3 13  4 15  8  6  5],
[10  1  3  2  9  5  7],
[1  5 10  3  8 14],
[15  3  7 10  4  5  8  6],
[14  9 11 15],
[7  6 13 14 10 11  9  4]

It should return 162295 unique combinations.

like image 20
Joseph Wood Avatar answered Nov 15 '22 19:11

Joseph Wood


One way to save some work might be to generate deduplicated combinations of the first k chosen pools, then extend those to deduplicated combinations of the first k+1. This lets you avoid individually generating and rejecting all length-20 combinations that picked 2, 1 instead of 1, 2 from the first two pools:

def combinations_from_pools(pools):
    # 1-element set whose one element is an empty tuple.
    # With no built-in hashable multiset type, sorted tuples are probably the most efficient
    # multiset representation.
    combos = {()}
    for pool in pools:
        combos = {tuple(sorted(combo + (elem,))) for combo in combos for elem in pool}
    return combos

With the input sizes you're talking about, though, no matter how efficiently you generate combinations, you'll never be able to process all of them. Even with 20 identical 1000-element pools, there would be 496432432489450355564471512635900731810050 combinations (1019 choose 20, by the stars and bars formula), or about 5e41. If you conquered the Earth and devoted all the processing power of all the computing equipment of all of humanity to the task, you still couldn't make a dent in that. You need to find a better way to solve your underlying task.

like image 7
user2357112 supports Monica Avatar answered Nov 15 '22 17:11

user2357112 supports Monica


Answers that have been posted so far (including lazy lexicographic one-at-a-time generation by Tim Peters) have worst-case space complexity proportional to the size of the output. I am going to outline an approach that will constructively produce all unique unordered combinations without deduplication of internally generated intermediate data. My algorithm generates the combinations in a lexicographically sorted order. It has a computational overhead compared to the simpler algorithms. However it can be parallelized (so that different ranges of the final output can be produced concurrently).

The idea is the following.

So we have N pools {P1, ..., PN} where we must draw our combinations from. We can easily identify the smallest combination (with respect to the mentioned lexicographical ordering). Let it be (x1, x2 ..., xN-1, xN) (where x1 <= x2 <= ... <= xN-1 <= xN, and each xj is just the smallest element from one of the pools {Pi}). This smallest combination will be followed by zero or more combinations where the prefix x1, x2 ..., xN-1 is the same and the last position runs over an increasing sequence of values. How can we identify that sequence?

Let's introduce the following definition:

Given a combination prefix C=(x1, x2 ..., xK-1, xK) (where K < N), a pool Pi is called free with respect to C if the latter (the prefix) can be drawn from the rest of the pools.

Identifying free pools for a given prefix is easily reduced to the problem of finding maximum matchings in a bipartite graph. The challenging part is doing it efficiently (taking advantage of the specifics of our case). But I will save it for later (this is work in progress, to be materialized as a Python program in a day).

So, for the prefix (x1, x2 ..., xN-1) of our first combination we can identify all the free pools {FPi}. Any one of them can be used to pick an element for the last position. Therefore the sequence of interest is the sorted set of elements from {FP1 U FP2 U ... } that are greater than or equal to xN-1.

When the last position is exhausted we must increase the last-but-one position whereupon we will repeat the procedure of finding the possible values for the last position. Unsuprisingly, the procedure of enumerating the values for the last-but-one (as well as any other) position is the same - the only difference is the length of the combination prefix based on which free pools must be identified.

Thus the following recursive algorithm should do the work:

  1. Start with an empty combination prefix C. At this point all pools are free.
  2. If the length of C is equal to N, then output C and return.
  3. Merge the free pools into one sorted list S and drop from it all elements that are less than the last element of C.
  4. For each value x from S do
    • The new combination prefix is C'=(C, x)
    • With the current combination prefix having grown by one, some of the free pools stop being free. Identify them and recurse into step 1 with an updated free pool list and combination prefix C'.
like image 5
Leon Avatar answered Nov 15 '22 17:11

Leon