I need to obtain a k-sized sample without replacement from a population, where each member of the population has a associated weight (W). Numpy's random.choices will not perform this task without replacement, and random.sample won't take a weighted input. Currently, this is what I am using: <pre class="prettyprint"><code>P = np.zeros((1,Parent_number)) n=0 while n < Parent_number: draw = random.choices(population,weights=W,k=1) if draw not in P: P[0,n] = draw[0] n=n+1 P=np.asarray(sorted(P[0])) </code></pre> While this works, it reqires switching back and forth from arrays, to lists and back to arrays and is, therefore, less than ideal. I am looking for the simplest and easiest to understand solution as this code will be shared with others.

You can use <code>np.random.choice</code> with <code>replace=False</code> as follows: <pre class="prettyprint"><code>np.random.choice(vec,size,replace=False, p=P) </code></pre> where <code>vec</code> is your population and <code>P</code> is the weight vector. For example: <pre class="prettyprint"><code>import numpy as np vec=[1,2,3] P=[0.5,0.2,0.3] np.random.choice(vec,size=2,replace=False, p=P) </code></pre>

<h3>Built-in solution</h3> As suggested by Miriam Farber, you can just use the numpy's builtin solution: <pre class="prettyprint"><code>np.random.choice(vec,size,replace=False, p=P) </code></pre> <h3>Pure python equivalent</h3> What follows is close to what numpy does internally. It, of course, uses numpy arrays and numpy.random.choices(): <pre class="prettyprint"><code>from random import choices def weighted_sample_without_replacement(population, weights, k=1): weights = list(weights) positions = range(len(population)) indices = [] while True: needed = k - len(indices) if not needed: break for i in choices(positions, weights, k=needed): if weights[i]: weights[i] = 0.0 indices.append(i) return [population[i] for i in indices] </code></pre> <h3>Related problem: Selection when elements can be repeated</h3> This is sometimes called an urn problem. For example, given an urn with 10 red balls, 4 white balls, and 18 green balls, choose nine balls without replacement. To do it with numpy, generate the unique selections from the total population count with sample(). Then, bisect the cumulative weights to get the population indices. <pre class="prettyprint"><code>import numpy as np from random import sample population = np.array(['red', 'blue', 'green']) counts = np.array([10, 4, 18]) k = 9 cum_counts = np.add.accumulate(counts) total = cum_counts[-1] selections = sample(range(total), k=k) indices = np.searchsorted(cum_counts, selections, side='right') result = population[indices] </code></pre> To do this without *numpy', the same approach can be implemented with bisect() and accumulate() from the standard library: <pre class="prettyprint"><code>from random import sample from bisect import bisect from itertools import accumulate population = ['red', 'blue', 'green'] weights = [10, 4, 18] k = 9 cum_weights = list(accumulate(weights)) total = cum_weights.pop() selections = sample(range(total), k=k) indices = [bisect(cum_weights, s) for s in selections] result = [population[i] for i in indices] </code></pre>

Weighted random sample without replacement in python

Tags:

python

random

numpy

I need to obtain a k-sized sample without replacement from a population, where each member of the population has a associated weight (W).

Numpy's random.choices will not perform this task without replacement, and random.sample won't take a weighted input.

Currently, this is what I am using:

P = np.zeros((1,Parent_number))
n=0
while n < Parent_number:
    draw = random.choices(population,weights=W,k=1)
    if draw not in P:
        P[0,n] = draw[0]
        n=n+1
P=np.asarray(sorted(P[0]))

While this works, it reqires switching back and forth from arrays, to lists and back to arrays and is, therefore, less than ideal.

I am looking for the simplest and easiest to understand solution as this code will be shared with others.

281

asked Apr 21 '17 18:04

Austin Downey

2 Answers

You can use np.random.choice with replace=False as follows:

np.random.choice(vec,size,replace=False, p=P)

where vec is your population and P is the weight vector.

For example:

import numpy as np
vec=[1,2,3]
P=[0.5,0.2,0.3]
np.random.choice(vec,size=2,replace=False, p=P)

answered Oct 02 '22 15:10

Miriam Farber

Built-in solution

As suggested by Miriam Farber, you can just use the numpy's builtin solution:

np.random.choice(vec,size,replace=False, p=P)

Pure python equivalent

What follows is close to what numpy does internally. It, of course, uses numpy arrays and numpy.random.choices():

from random import choices

def weighted_sample_without_replacement(population, weights, k=1):
    weights = list(weights)
    positions = range(len(population))
    indices = []
    while True:
        needed = k - len(indices)
        if not needed:
            break
        for i in choices(positions, weights, k=needed):
            if weights[i]:
                weights[i] = 0.0
                indices.append(i)
    return [population[i] for i in indices]

Related problem: Selection when elements can be repeated

This is sometimes called an urn problem. For example, given an urn with 10 red balls, 4 white balls, and 18 green balls, choose nine balls without replacement.

To do it with numpy, generate the unique selections from the total population count with sample(). Then, bisect the cumulative weights to get the population indices.

import numpy as np
from random import sample

population = np.array(['red', 'blue', 'green'])
counts = np.array([10, 4, 18])
k = 9

cum_counts = np.add.accumulate(counts)
total = cum_counts[-1]
selections = sample(range(total), k=k)
indices = np.searchsorted(cum_counts, selections, side='right')
result = population[indices]

To do this without *numpy', the same approach can be implemented with bisect() and accumulate() from the standard library:

from random import sample
from bisect import bisect
from itertools import accumulate

population = ['red', 'blue', 'green']
weights = [10, 4, 18]
k = 9

cum_weights = list(accumulate(weights))
total = cum_weights.pop()
selections = sample(range(total), k=k)
indices = [bisect(cum_weights, s) for s in selections]
result = [population[i] for i in indices]

answered Oct 02 '22 17:10

Raymond Hettinger

Related questions
                            
                                Django Celery send register email do not work
                            
                                Django-rest-framework permissions for create in viewset
                            
                                How to make a post with a from data of empty json through HTTPie?
                            
                                Celery task always PENDING
                            
                                Draggable line with draggable points
                            
                                Equal Error Rate in Python
                            
                                How to list all unused jenkins plugins?
                            
                                Python, how to enable all warnings?
                            
                                Can't open video using opencv
                            
                                Django: show the count of related objects in admin list_display
                            
                                OSError: dlopen(libSystem.dylib, 6): image not found
                            
                                How to get boxplot data for matplotlib boxplots
                            
                                Does GridSearchCV store all the scores for all parameter combinations?
                            
                                Django and 'virtualenv' - proper project structure
                            
                                Subprocess timeout failure
                            
                                Add a new sheet to a existing workbook in python
                            
                                How to generate a unique auth token in python?
                            
                                Why is Collections.counter so slow?
                            
                                Retry function in Python
                            
                                Rename nested field in spark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With