Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deterministic python generator for K disparate M-sized subsets of a set

I'm looking for an efficient way to generate many M-sized subsets from a set S, of size N.

Ideally I would like to generate all of these, but because I'm using them for other computations, this becomes infeasible.

Instead, I would like to generate K disparate subsets of S, such that the K chosen subsets minimize the sum of the size of the all pairwise intersections between the K subsets.

In other words If I have K subsets And I take the pairwise intersection of all of those subsets. And then I sum the size of all of those sets together. I get as low of a number as I can.

Basically, I want these subsets to be as "far away" from each other was possible. I've been trying to think of how I would go about doing this, but I'm drawing a blank.

To simulate it in the meantime I've written this function

        def subset_split(full_set, M, K):
            np.random.seed(0) # repeatibility
            seen = set([])
            subset_list = []
            for kx in xrange(K):
                np.random.shuffle(full_set)
                failsafe = 0
                while True: 
                    np.random.shuffle(full_set)
                    subset = tuple(full_set[0:M])
                    if not subset in seen: 
                        seen.add(subset)
                        subset_list.append(subset)
                        break
                    failsafe += 1
                    if failsafe > 100:
                        break
            return subset_list

which just generates K random subsets that haven't been seen before. But this isn't exactly what I want, because I want those K subsets to be repeatable and to not accidentally be close to each if they don't have to be.

like image 978
Erotemic Avatar asked Sep 11 '13 02:09

Erotemic


1 Answers

Well, I'm still having fun with this ;-)

Where this left off, it became clear that we're trying to minimize:

sum(n*(n-1)//2 for n in index2count.values())

That's minimal if all the values are the same (when possible), or there are two distinct values one apart (like, if len(full_set) = 10, seven 3's and three 4's). That's enough to write a generator that doesn't bother to compute index2count at all. Instead hi is the set of indices with the larger of the two values, and lo the rest of the indices (lo is empty if all the (conceptual! not computed) values are the same).

This drops the K argument, and takes a different approach to duplicates: it ignores them. It's expensive and clumsy to keep track of duplicates here, and duplicates should really be expected in a "random-ish" generator. If that bothers you, wrap it in another routine to weed out duplicates.

Sample output:

>>> from itertools import islice
>>> for s in islice(gen_subsets_special(set(range(5)), 1), 10):
...    print s
set([4])
set([3])
set([0])
set([1])
set([2])
set([0])
set([3])
set([1])
set([2])
set([4])

>>> for s in islice(gen_subsets_special(set(range(10, 20)), 3), 10):
...    print s
set([17, 18, 10])
set([11, 12, 14])
set([16, 19, 13])
set([12, 13, 15])
set([17, 10, 11])
set([16, 19, 15])
set([17, 18, 14])
set([16, 18, 13])
set([19, 12, 15])
set([10, 11, 14])

Here's the code:

def gen_subsets_special(full_set, M, seed=123456):
    # generate randomish M-subsets of full_set, "far apart".
    import random
    from random import sample
    random.seed(seed)
    elements = list(full_set)
    N = len(elements)
    hi = set(range(N))
    lo = set()
    while True:
        assert not (hi & lo)
        assert len(lo | hi) == N
        # First take indices from hi, then (if needed) from lo.
        if len(hi) > M:
            # We can take all of them from hi, with some left over.
            ixhi = set(sample(hi, M))
            ixlo = set()
            # The ixhi counts go down by 1, so move 'em to lo
            hi -= ixhi
            lo |= ixhi
            assert hi
        else:
            # We need to take everything in hi.
            ixhi = hi.copy()
            ixlo = set(sample(lo, M - len(ixhi)))
            hi |= lo - ixlo
            lo = ixlo
        assert not (ixlo & ixhi)
        ix = ixlo | ixhi
        assert len(ix) == M
        yield set(elements[i] for i in ix)

By construction, every prefix of the generated sequence minimizes the sum of the sizes of the intersections of all pairs of sets in the prefix. Or so it seems to me right now ;-)

Finally, a variation that more obviously just repeatedly cycles thru all the indices. This is probably the one I'd actually use:

def gen_subsets_special(full_set, M, seed=123456):
    # generate randomish M-subsets of full_set, "far apart".
    import random
    from random import sample
    random.seed(seed)
    elements = list(full_set)
    allix = set(range(len(elements)))
    takefrom = allix.copy()

    def destructive_sample(n):
        # Remove a random n-subset from takefrom, & return it.
        s = set(sample(takefrom, n))
        takefrom.difference_update(s)
        return s

    while True:
        if len(takefrom) >= M:
            # Get everything from takefrom.
            ix = destructive_sample(M)
        else:
            # We need to take all of takefrom, and more.
            ix = takefrom
            takefrom = allix - ix
            ix |= destructive_sample(M - len(ix))
        assert len(ix) == M
        yield set(elements[i] for i in ix)
like image 162
Tim Peters Avatar answered Sep 21 '22 04:09

Tim Peters