Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to obtain a random subset

How would I get a random subset of a set s in python? I tried doing

from random import sample, randint

def random_subset(s):
    length = randint(0, len(s))
    return set(sample(s, length))

But I now realize that this obviously doesn't work since the distribution of the len(s) where s is a random subset is not uniform from 0 to n.

I'm sure I could compute that distribution and use numpy's sample with probability, or something like that, but I'd like something preferably with pure python.

like image 411
Enrico Borba Avatar asked Feb 19 '19 03:02

Enrico Borba


People also ask

How do you get a random subset of a list in Python?

In Python, you can randomly sample elements from a list with choice() , sample() , and choices() of the random module. These functions can also be applied to a string and tuple. choice() returns one random element, and sample() and choices() return a list of multiple random elements.

How do you generate a random sample in Python?

You can use random. randint() and random. randrange() to generate the random numbers, but it can repeat the numbers. To create a list of unique random numbers, we need to use the sample() method.

What is random sample in Python?

Python | random.sample() function. sample() is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence i.e. list, tuple, string or set. Used for random sampling without replacement.

How to generate a random sample from an array in NumPy?

To generate a random sample, numpy.random.choice permutes the array each time we call it. When our sample size is only a fraction of the whole array length, we do not need to shuffle the array each time we want to take a sample. Let’s just shuffle it once and take samples from the start of the shuffled array.

How to print all subsets of a set in Python?

Python has itertools.combinations (iterable, n) which Return n length subsequences of elements from the input iterable. This can be used to Print all subsets of given size of a set. Now, we have various alternatives to use this function.

What is the use of sample in Python?

Last Updated : 29 Aug, 2018 sample () is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence i.e. list, tuple, string or set. Used for random sampling without replacement. Syntax : random.sample (sequence, k)


2 Answers

I just realized I can simply go through each element in s and decide independently to keep it or not. Something like this

from random import randint

def random_subset(s):
    out = set()
    for el in s:                                                                                                                    
        # random coin flip
        if randint(0, 1) == 0:
            out.add(el)
    return out

This has the correct distribution.

like image 147
Enrico Borba Avatar answered Sep 21 '22 21:09

Enrico Borba


What subset you obtain will depend largely on the criterion you specify for including or excluding elements. If you have a function criterion that accepts an element and returns a Boolean to indicate inclusion in the subset, the actual creation process becomes simply

from random import randrange

def random_subset(s, criterion=lambda x: randrange(2)):
    return set(filter(criterion, s))

filter creates a lazy generator, so the return subset is the only place the selection gets stored. The default criterion is very simple and has a uniform distribution. randrange is similar to randint except that it is exclusive in the right bound. At least as of Python 3.2+, both functions produce fairly uniform results regardless of range size.

You can further refine the criterion by using random:

from random import random

criterion = lambda x: random() < 0.5

Applying a threshold like that may seem like overkill, but it lets you adjust the distribution. You can have a function that generates criteria for whatever threshold you like:

def make_criterion(threshold=0.5):
    return lambda x: random() < threshold

You could use it to get a smaller subset:

random_subset(s, make_criterion(0.1))

In fact, you can make the criterion as complicated as you would like. The following example is a contrived callable class that operates on sets of strings. If a string with a matching first character has already been added, it automatically rejects the current element. If the second letter has been seen already, it sets the probability of inclusion to 0.25. Otherwise, it flips a coin:

class WeirdCriterion:

    def __init__(self):
        self.first = set()
        self.second = set()

    def __call__(self, x):
        n = len(x)
        if n > 0:
            if x[0] in self.first:
                return False
            self.first.add(x[0])
            if n > 1:
                if x[1] in self.second:
                    return not randrange(4)
                self.second.add(x[1])
        return randrange(2)

This example wouldn't be very good in practice because sets are unordered, and can give different iteration orders between different runs of the same script. What it shows, however, is a method for creating a criterion that is random, but is adjusted in response to elements that are already in the subset.

Avoiding Numpy

Now that I have a better understanding of your original intent, you can leverage the fact that Python 3 has infinite length integers and that choices accepts a length parameter to get the correct length. I don't recommend this approach though:

from random import choices, sample
from math import factorial

def random_subset(s):
    n = len(s)
    nf = factorial(n)
    # yes, there are better ways of doing this, even in pure python
    weights = [nf / (factorial(k) * factorial(n - k)) for k in range(n + 1)]
    length = choices(range(n + 1), weights, k=1)[0]
    return sample(s, length)

A better solution for computing the binomial coefficients could be something like:

def pascal(n):
    result = [1] * (n + 1)
    if n < 2:
        return result
    for i in range(2, n + 1):
        for j in range(i - 1, 0, -1):
            result[j] += result[j - 1]
    return result
like image 31
Mad Physicist Avatar answered Sep 18 '22 21:09

Mad Physicist