Weighted random selection with and without replacement

Tags:

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.

Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

770

asked Dec 09 '08 13:12

Nick Johnson

2 Answers

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.

Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)

To create the alias lookup:

Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.

Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.

For example, if we run another iteration of 3 and 4, we see

(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned

At runtime:

Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.

Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

134

answered Oct 01 '22 02:10

John with waffle

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:

import heapq import math import random  def WeightedSelectionWithoutReplacement(weights, m):     elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]     return [x[1] for x in heapq.nlargest(m, elt)]

This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

answered Oct 01 '22 03:10

josliber

Related questions
                            
                                Vim's Omnicompletion with Python just doesn't work
                            
                                Airflow "This DAG isnt available in the webserver DagBag object "
                            
                                python: [Errno 10054] An existing connection was forcibly closed by the remote host
                            
                                Python -- import the package in a module that is inside the same package
                            
                                Java - How to do Python's Try Except Else
                            
                                Microsoft Visual C++ Compiler for Python 3.4
                            
                                How to create table during Django tests with managed = False
                            
                                Python datetime strptime() and strftime(): how to preserve the timezone information
                            
                                Is there a performance difference between Numpy and Pandas?
                            
                                What does tf.gfile do in TensorFlow?
                            
                                Re-assigning a name to itself
                            
                                Python Multiprocessing: What's the difference between map and imap?
                            
                                When to use imshow over pcolormesh?
                            
                                Using Design by Contract in Python
                            
                                python dict.update vs. subscript to add a single key/value pair [closed]
                            
                                What do * (single star) and / (slash) do as independent parameters? [duplicate]
                            
                                Make @lru_cache ignore some of the function arguments
                            
                                Can a simple difference in Python3 variable names alter the way code runs? [duplicate]
                            
                                Measuring elapsed time in python
                            
                                What does np.r_ do (numpy)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Weighted random selection with and without replacement

Tags:

python

algorithm

random

random-sample

Nick Johnson

People also ask

2 Answers

John with waffle

josliber

Recent Activity

Donate For Us