I'm looking for an efficient way to generate many M-sized subsets from a set S, of size N.
Ideally I would like to generate all of these, but because I'm using them for other computations, this becomes infeasible.
Instead, I would like to generate K disparate subsets of S, such that the K chosen subsets minimize the sum of the size of the all pairwise intersections between the K subsets.
In other words If I have K subsets And I take the pairwise intersection of all of those subsets. And then I sum the size of all of those sets together. I get as low of a number as I can.
Basically, I want these subsets to be as "far away" from each other was possible. I've been trying to think of how I would go about doing this, but I'm drawing a blank.
To simulate it in the meantime I've written this function
def subset_split(full_set, M, K):
np.random.seed(0) # repeatibility
seen = set([])
subset_list = []
for kx in xrange(K):
np.random.shuffle(full_set)
failsafe = 0
while True:
np.random.shuffle(full_set)
subset = tuple(full_set[0:M])
if not subset in seen:
seen.add(subset)
subset_list.append(subset)
break
failsafe += 1
if failsafe > 100:
break
return subset_list
which just generates K random subsets that haven't been seen before. But this isn't exactly what I want, because I want those K subsets to be repeatable and to not accidentally be close to each if they don't have to be.
Well, I'm still having fun with this ;-)
Where this left off, it became clear that we're trying to minimize:
sum(n*(n-1)//2 for n in index2count.values())
That's minimal if all the values are the same (when possible), or there are two distinct values one apart (like, if len(full_set) = 10
, seven 3's and three 4's). That's enough to write a generator that doesn't bother to compute index2count
at all. Instead hi
is the set of indices with the larger of the two values, and lo
the rest of the indices (lo
is empty if all the (conceptual! not computed) values are the same).
This drops the K
argument, and takes a different approach to duplicates: it ignores them. It's expensive and clumsy to keep track of duplicates here, and duplicates should really be expected in a "random-ish" generator. If that bothers you, wrap it in another routine to weed out duplicates.
Sample output:
>>> from itertools import islice
>>> for s in islice(gen_subsets_special(set(range(5)), 1), 10):
... print s
set([4])
set([3])
set([0])
set([1])
set([2])
set([0])
set([3])
set([1])
set([2])
set([4])
>>> for s in islice(gen_subsets_special(set(range(10, 20)), 3), 10):
... print s
set([17, 18, 10])
set([11, 12, 14])
set([16, 19, 13])
set([12, 13, 15])
set([17, 10, 11])
set([16, 19, 15])
set([17, 18, 14])
set([16, 18, 13])
set([19, 12, 15])
set([10, 11, 14])
Here's the code:
def gen_subsets_special(full_set, M, seed=123456):
# generate randomish M-subsets of full_set, "far apart".
import random
from random import sample
random.seed(seed)
elements = list(full_set)
N = len(elements)
hi = set(range(N))
lo = set()
while True:
assert not (hi & lo)
assert len(lo | hi) == N
# First take indices from hi, then (if needed) from lo.
if len(hi) > M:
# We can take all of them from hi, with some left over.
ixhi = set(sample(hi, M))
ixlo = set()
# The ixhi counts go down by 1, so move 'em to lo
hi -= ixhi
lo |= ixhi
assert hi
else:
# We need to take everything in hi.
ixhi = hi.copy()
ixlo = set(sample(lo, M - len(ixhi)))
hi |= lo - ixlo
lo = ixlo
assert not (ixlo & ixhi)
ix = ixlo | ixhi
assert len(ix) == M
yield set(elements[i] for i in ix)
By construction, every prefix of the generated sequence minimizes the sum of the sizes of the intersections of all pairs of sets in the prefix. Or so it seems to me right now ;-)
Finally, a variation that more obviously just repeatedly cycles thru all the indices. This is probably the one I'd actually use:
def gen_subsets_special(full_set, M, seed=123456):
# generate randomish M-subsets of full_set, "far apart".
import random
from random import sample
random.seed(seed)
elements = list(full_set)
allix = set(range(len(elements)))
takefrom = allix.copy()
def destructive_sample(n):
# Remove a random n-subset from takefrom, & return it.
s = set(sample(takefrom, n))
takefrom.difference_update(s)
return s
while True:
if len(takefrom) >= M:
# Get everything from takefrom.
ix = destructive_sample(M)
else:
# We need to take all of takefrom, and more.
ix = takefrom
takefrom = allix - ix
ix |= destructive_sample(M - len(ix))
assert len(ix) == M
yield set(elements[i] for i in ix)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With