Python: How to obtain a random subset

Tags:

How would I get a random subset of a set s in python? I tried doing

from random import sample, randint

def random_subset(s):
    length = randint(0, len(s))
    return set(sample(s, length))

But I now realize that this obviously doesn't work since the distribution of the len(s) where s is a random subset is not uniform from 0 to n.

I'm sure I could compute that distribution and use numpy's sample with probability, or something like that, but I'd like something preferably with pure python.

411

asked Feb 19 '19 03:02

2 Answers

I just realized I can simply go through each element in s and decide independently to keep it or not. Something like this

from random import randint

def random_subset(s):
    out = set()
    for el in s:                                                                                                                    
        # random coin flip
        if randint(0, 1) == 0:
            out.add(el)
    return out

This has the correct distribution.

147

answered Sep 21 '22 21:09

What subset you obtain will depend largely on the criterion you specify for including or excluding elements. If you have a function criterion that accepts an element and returns a Boolean to indicate inclusion in the subset, the actual creation process becomes simply

from random import randrange

def random_subset(s, criterion=lambda x: randrange(2)):
    return set(filter(criterion, s))

filter creates a lazy generator, so the return subset is the only place the selection gets stored. The default criterion is very simple and has a uniform distribution. randrange is similar to randint except that it is exclusive in the right bound. At least as of Python 3.2+, both functions produce fairly uniform results regardless of range size.

You can further refine the criterion by using random:

from random import random

criterion = lambda x: random() < 0.5

Applying a threshold like that may seem like overkill, but it lets you adjust the distribution. You can have a function that generates criteria for whatever threshold you like:

def make_criterion(threshold=0.5):
    return lambda x: random() < threshold

You could use it to get a smaller subset:

random_subset(s, make_criterion(0.1))

In fact, you can make the criterion as complicated as you would like. The following example is a contrived callable class that operates on sets of strings. If a string with a matching first character has already been added, it automatically rejects the current element. If the second letter has been seen already, it sets the probability of inclusion to 0.25. Otherwise, it flips a coin:

class WeirdCriterion:

    def __init__(self):
        self.first = set()
        self.second = set()

    def __call__(self, x):
        n = len(x)
        if n > 0:
            if x[0] in self.first:
                return False
            self.first.add(x[0])
            if n > 1:
                if x[1] in self.second:
                    return not randrange(4)
                self.second.add(x[1])
        return randrange(2)

This example wouldn't be very good in practice because sets are unordered, and can give different iteration orders between different runs of the same script. What it shows, however, is a method for creating a criterion that is random, but is adjusted in response to elements that are already in the subset.

Avoiding Numpy

Now that I have a better understanding of your original intent, you can leverage the fact that Python 3 has infinite length integers and that choices accepts a length parameter to get the correct length. I don't recommend this approach though:

from random import choices, sample
from math import factorial

def random_subset(s):
    n = len(s)
    nf = factorial(n)
    # yes, there are better ways of doing this, even in pure python
    weights = [nf / (factorial(k) * factorial(n - k)) for k in range(n + 1)]
    length = choices(range(n + 1), weights, k=1)[0]
    return sample(s, length)

A better solution for computing the binomial coefficients could be something like:

def pascal(n):
    result = [1] * (n + 1)
    if n < 2:
        return result
    for i in range(2, n + 1):
        for j in range(i - 1, 0, -1):
            result[j] += result[j - 1]
    return result

answered Sep 18 '22 21:09

Mad Physicist

Related questions
                            
                                After login the `rest-auth`, how to return more information?
                            
                                Matplotlib - Creating plot for black background presentation slides
                            
                                Pandas Write to Excel rearranging columns based on alphabetic order
                            
                                vscode python fail to discover unit tests recursively
                            
                                Python: setup of logging, allowing multiline strings: logging.info('foo\nbar')
                            
                                How to preselect (set default) python interpreter in python visual code extension?
                            
                                Allow argparse nargs="+" to accept comma-separated input with choices [duplicate]
                            
                                pytest-mock assert_called_with failed for class function
                            
                                Sliding window of a batch in Tensorflow using Dataset API
                            
                                Fastest way to check if duplicates exist in a python list / numpy ndarray
                            
                                Setting default number format when writing to Excel from Pandas
                            
                                Python: Random list of numbers in a range keeping with a minimum distance
                            
                                Print Hex With Spaces Between
                            
                                How to set the value of a pandas column as list
                            
                                how to duplicate each row of a matrix N times Numpy
                            
                                Get enum name in Python without class name
                            
                                Segmentation fault when creating multiprocessing array
                            
                                Embedding multiple Python sub-interpreters into a C program
                            
                                Efficient way to add elements to a tuple
                            
                                How can I make a virtual environment work with pyenv?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: How to obtain a random subset

Tags:

python

python-3.x

set

subset

Enrico Borba

People also ask

2 Answers

Enrico Borba

Mad Physicist

Recent Activity

Donate For Us