Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to split a list into a minimum amount of sets, enumerating all possible solutions

Say I have a list of numbers with duplicants.

import random
lst = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
random.shuffle(lst)

I want to split the list into a minimum amount of sub"set"s with all unique numbers, without discarding any numbers. I managed to write the following code, but I feel like this is hard-coded, so there should be faster and more general solutions.

from collections import Counter

counter = Counter(lst)
maxcount = counter.most_common(1)[0][1]
res = []
while maxcount > 0:
    res.append(set(x for x in lst if counter[x] >= maxcount))
    maxcount -= 1
assert len([x for st in res for x in st]) == len(lst)
print(res)

Output:

[{4}, {8, 2, 4}, {0, 2, 3, 4, 7, 8}, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]

Obviously, this is only one of the solutions. Another solution could be

[{4, 9}, {8, 2, 4}, {0, 2, 3, 4, 7, 8}, {0, 1, 2, 3, 4, 5, 6, 7, 8}]

I want to find all possible solutions with minimum amount of sub"set"s (4 in this case). Note that same numbers are indistinguishable, e.g. [{1}, {1, 2}] is the same solution as [{1, 2}, {1}] for a list of [1, 2, 1].

Any suggestions?

like image 634
Shaun Han Avatar asked Dec 09 '22 23:12

Shaun Han


2 Answers

This way takes time linear in the number of list elements, and its output is the same (same sets, in the same order) regardless of the input list's order. It's basically a more "eager" variation of your code:

    def split(xs):
        from collections import defaultdict
        x2count = defaultdict(int)
        result = []
        for x in xs:
            x2count[x] += 1
            count = x2count[x]
            if count > len(result):
                result.append(set())
            result[count - 1].add(x)
        return result

Then, e.g.,

    xs = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
    import random
    random.shuffle(xs)
    print(split(xs))

displays

[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}, {0, 2, 3, 4, 7, 8}, {8, 2, 4}, {4}]

Finding all answers is bound to be annoying ;-) But straightforward enough. Once you know there are 4 sets in the result, then you have a hairy kind of Cartesian product to compute. You know that, e.g., 7 appears twice, so there are comb(4, 2) == 6 ways to pick the two result sets 7 goes in. For each of those ways, you know, e.g., that 8 appears 3 times, so there are comb(4, 3) == 4 ways to pick the three results sets 8 goes in. Now we're up to 6 * 4 = 24 partial results. Repeat similarly for all other original integers. itertools.combinations() can be used to do the choosing.

Unclear: consider input [1, 1, 2]. The output here is [{1, 2}, {1}]. Do you, or do you not, consider that to be the same as [{1}, {1, 2}]? That is, do you consider an output to be a sequence of sets (in which case they're different), or as a set of sets (in which case they're the same)? A straightforward Cartesian product takes the "it's a sequence" view.

Finding all of 'em

Here's a way. As sketched, it computes a Cartesian product of all ways of distributing each element across the number of required output sets. But rather than use itertools.product() for this, it does it recursively, one element at a time. This allows it to check partial results so far for isomorphism, and decline to extend any partial solution that's isomorphic to a partial solution it already extended.

Toward that end, it views a partial solution as a set of sets. For technical reasons, Python requires using a frozenset for a set that'a going to be used in turn as a set element.

Caution: this generator yields the same result object every time. That's for efficiency. If you don't like that, you could, e.g., replace

            yield result

with

            yield result[:]

instead.

EDIT: note that I replaced the line

            sig = frozenset(map(frozenset, result))

with

            sig = frozenset(Counter(map(frozenset, result)).items())

That's because you really aren't viewing the result as a set of sets, but as a multiset of sets (a given set can appear more than once in a result, and the number of times it appears is significant). In fancier test cases than were given here, that can make a real difference.

A Counter is the closest thing Python has to a builtin multiset type, but there is no "frozen" Counter workalike akin to frozensets. So instead we turn the Counter into a sequence of 2-tuples, and put those tuples into a frozenset. By using (set, count) pairs, this allows us to account for that the number of times a set appears in a result is significant.

def allsplit(xs):
    from collections import Counter
    from itertools import combinations
    c = Counter(xs)
    n = max(c.values())
    result = [set() for i in range(n)]
    pairs = list(c.items())
    pin = len(pairs)

    def inner(pi):
        if pi >= pin:
            yield result
            return
        elt, count = pairs[pi]
        seen = set()
        for ixs in combinations(range(n), count):
            for i in ixs:
                result[i].add(elt)
            sig = frozenset(Counter(map(frozenset, result)).items())
            if sig not in seen:
                yield from inner(pi + 1)
                seen.add(sig)
            for i in ixs:
                result[i].remove(elt)

    return inner(0)

Example:

>>> for x in allsplit([1, 1, 2, 3, 8, 4, 4]):
...     print(x)
    
[{1, 2, 3, 4, 8}, {1, 4}]
[{1, 2, 3, 4}, {8, 1, 4}]
[{1, 2, 4, 8}, {1, 3, 4}]
[{1, 2, 4}, {1, 3, 4, 8}]

For your original example, it finds 36992 unique ways to partition the input.

like image 192
Tim Peters Avatar answered Dec 12 '22 13:12

Tim Peters


I'd suggest using a prefilled list then store each value in separate buckets

import random
from collections import Counter

lst = [0, 0, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 7, 8, 8, 8, 9]
random.shuffle(lst)

c = Counter(lst)
maxcount = c.most_common(1)[0][1]
result = [set() for _ in range(maxcount)]
for k, v in c.items():
    for i in range(v):
        result[i].add(k)

print(result)

Can also be achieved with a defaultdict

c = Counter(lst)
result = defaultdict(set)
for k, v in c.items():
    for i in range(v):
        result[i].add(k)
result = list(result.values())
print(result)

Note on performance

from timeit import timeit
import numpy as np
lst = list(np.random.randint(0, 100, 10000))
nb = 1000
print(timeit(lambda: prefilled_list(lst), number=nb))   # 2.144
print(timeit(lambda: default_dict_set(lst), number=nb)) # 1.903
print(timeit(lambda: op_while_loop(lst), number=nb))    # 318.2
like image 35
azro Avatar answered Dec 12 '22 13:12

azro