Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently sampling from a multiset (Counter) in Python

Tags:

python

Annoyingly, the following doesn't work:

from collections import Counter
import random

c = Counter([1,1,1,1,0,0])
random.choice(c) # I expect this to return 1 with probability 2/3, 
                 # and 0 with probability 1/3.
                 # It actually returns 4 or 2, with probability 1/2

What is the idiomatic way to sample from a multiset in Python (any version)?

Edit yes, I do really need to use a multiset. My actual data is much bigger and just storing it in a list would not be practical.

Edit 2 I need to do this with a reasonable degree of efficiency, as my code will do this repeatedly. There will be a lot of data stored in the Counter object, and anything that involves copying all of this data into a new data structure is not going to be a viable solution.

like image 937
N. Virgo Avatar asked Apr 13 '14 05:04

N. Virgo


People also ask

What is the time complexity of Counter in Python?

As the source code shows, Counter is just a subclass of dict. Constructing it is O(n), because it has to iterate over the input, but operations on individual elements remain O(1).

How do you do random sampling in Python?

To create a list of unique random numbers, we need to use the sample() method. Warp a range() function inside a sample() to create a list of random numbers without duplicates. The range() function generates the sequence f random numbers.

How do you access elements in a Counter Python?

Accessing Elements in Python Counter To get the list of elements in the counter we can use the elements() method. It returns an iterator object for the values in the Counter.


1 Answers

From the docs:

A common task is to make a random.choice() with weighted probabilities.

If the weights are small integer ratios, a simple technique is to build a sample population with repeats:

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate(), and then locate the random value with bisect.bisect():

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

For your application, you will probably want to use the Counter to build a list of choices and a list of cumulative probabilities, then sample with the second technique.

like image 56
user2357112 supports Monica Avatar answered Sep 21 '22 02:09

user2357112 supports Monica