Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weighted random sample in python

I'm looking for a reasonable definition of a function weighted_sample that does not return just one random index for a list of given weights (which would be something like

def weighted_choice(weights, random=random):
    """ Given a list of weights [w_0, w_1, ..., w_n-1],
        return an index i in range(n) with probability proportional to w_i. """
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        if w<0:
            raise ValueError("Negative weight encountered.")
        rnd -= w
        if rnd < 0:
            return i
    raise ValueError("Sum of weights is not positive")

to give a categorical distribution with constant weights) but a random sample of k of those, without replacement, just as random.sample behaves compared to random.choice.

Just as weighted_choice can be written as

lambda weights: random.choice([val for val, cnt in enumerate(weights)
    for i in range(cnt)])

weighted_sample could be written as

lambda weights, k: random.sample([val for val, cnt in enumerate(weights)
    for i in range(cnt)], k)

but I would like a solution that does not require me to unravel the weights into a (possibly huge) list.

Edit: If there are any nice algorithms that give me back a histogram/list of frequencies (in the same format as the argument weights) instead of a sequence of indices, that would also be very useful.

like image 288
Anaphory Avatar asked Oct 24 '12 10:10

Anaphory


2 Answers

From your code: ..

weight_sample_indexes = lambda weights, k: random.sample([val 
        for val, cnt in enumerate(weights) for i in range(cnt)], k)

.. I assume that weights are positive integers and by "without replacement" you mean without replacement for the unraveled sequence.

Here's a solution based on random.sample and O(log n) __getitem__:

import bisect
import random
from collections import Counter, Sequence

def weighted_sample(population, weights, k):
    return random.sample(WeightedPopulation(population, weights), k)

class WeightedPopulation(Sequence):
    def __init__(self, population, weights):
        assert len(population) == len(weights) > 0
        self.population = population
        self.cumweights = []
        cumsum = 0 # compute cumulative weight
        for w in weights:
            cumsum += w   
            self.cumweights.append(cumsum)  
    def __len__(self):
        return self.cumweights[-1]
    def __getitem__(self, i):
        if not 0 <= i < len(self):
            raise IndexError(i)
        return self.population[bisect.bisect(self.cumweights, i)]

Example

total = Counter()
for _ in range(1000):
    sample = weighted_sample("abc", [1,10,2], 5)
    total.update(sample)
print(sample)
print("Frequences %s" % (dict(Counter(sample)),))

# Check that values are sane
print("Total " + ', '.join("%s: %.0f" % (val, count * 1.0 / min(total.values()))
                           for val, count in total.most_common()))

Output

['b', 'b', 'b', 'c', 'c']
Frequences {'c': 2, 'b': 3}
Total b: 10, c: 2, a: 1
like image 91
jfs Avatar answered Oct 09 '22 09:10

jfs


What you want to create is a non-uniform random distribution. One bad way of doing this is to create a giant array with output symbols in proportion to the weights. So if a is 5 times more likely than b, you create an array with 5 times more a's than b's. This works fine for simple distributions where the weights are even multiples of each other. What if you wanted 99.99% a, and .01% b. You'd have to create 10000 slots.

There is a better way. All non-uniform distributions with N symbols can be decomposed into a series of n-1 binary distributions, each of which is equally likely.

So if you had such a decomponsition you'd first chose a binary distribution at random by generating a uniform random number from 1 - N-1

u32 dist = randInRange( 1, N-1 ); // generate a random number from 1 to N;

And then say the chosen distribution is a binary distribution with two symbols a and b, with a probability 0-alpha for a, and alpha-1 for b:

float f = randomFloat();
return ( f > alpha ) ? b : a;

How to decompose any non-uniform random distribution is a little more complex. Essentially you create N-1 'buckets'. Chose the symbols with the lowest probability and the one with the highest probability, and distribute their weights proportionally into the first binary distribution. Then delete the smallest symbol, and remove the amount of weight for the larger that was used to create this binary distribution. and repeat this process until you have no symbols left.

I can post c++ code for this if you want to go with this solution.

like image 26
Rafael Baptista Avatar answered Oct 09 '22 07:10

Rafael Baptista