Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.

I was wondering if there is a way to convert this approach to a weighted one.

I am also interested in other approaches as well.

Update: Sampling without replacement

like image 774
nimcap Avatar asked Jan 26 '10 16:01

nimcap


People also ask

How do you do weighted random selection?

For random selection with particular weights, the following technique can then be used: Generate a random number between 0 and s u m ( w ) − 1 sum(w)-1 sum(w)−1. Find the smallest index that corresponds to the prefix sum greater than the randomly chosen number.

What is weighted random sampling?

In weighted random sampling (WRS) the items are weighted and the probability of each item to be selected is determined by its relative weight.


1 Answers

If the sampling is with replacement, you can use this algorithm (implemented here in Python):

import random  items = [(10, "low"),          (100, "mid"),          (890, "large")]  def weighted_sample(items, n):     total = float(sum(w for w, v in items))     i = 0     w, v = items[0]     while n:         x = total * (1 - random.random() ** (1.0 / n))         total -= x         while x > w:             x -= w             i += 1             w, v = items[i]         w -= x         yield v         n -= 1 

This is O(n + m) where m is the number of items.

Why does this work? It is based on the following algorithm:

def n_random_numbers_decreasing(v, n):     """Like reversed(sorted(v * random() for i in range(n))),     but faster because we avoid sorting."""     while n:         v *= random.random() ** (1.0 / n)         yield v         n -= 1 

The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.

This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.

If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.

(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)

Here's an implementation of that, plentifully commented:

import random  class Node:     # Each node in the heap has a weight, value, and total weight.     # The total weight, self.tw, is self.w plus the weight of any children.     __slots__ = ['w', 'v', 'tw']     def __init__(self, w, v, tw):         self.w, self.v, self.tw = w, v, tw  def rws_heap(items):     # h is the heap. It's like a binary tree that lives in an array.     # It has a Node for each pair in `items`. h[1] is the root. Each     # other Node h[i] has a parent at h[i>>1]. Each node has up to 2     # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple     # arithmetic, we have to leave h[0] vacant.     h = [None]                          # leave h[0] vacant     for w, v in items:         h.append(Node(w, v, w))     for i in range(len(h) - 1, 1, -1):  # total up the tws         h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent     return h  def rws_heap_pop(h):     gas = h[1].tw * random.random()     # start with a random amount of gas      i = 1                     # start driving at the root     while gas >= h[i].w:      # while we have enough gas to get past node i:         gas -= h[i].w         #   drive past node i         i <<= 1               #   move to first child         if gas >= h[i].tw:    #   if we have enough gas:             gas -= h[i].tw    #     drive past first child and descendants             i += 1            #     move to second child     w = h[i].w                # out of gas! h[i] is the selected node.     v = h[i].v      h[i].w = 0                # make sure this node isn't chosen again     while i:                  # fix up total weights         h[i].tw -= w         i >>= 1     return v  def random_weighted_sample_no_replacement(items, n):     heap = rws_heap(items)              # just make a heap...     for i in range(n):         yield rws_heap_pop(heap)        # and pop n items off it. 
like image 97
12 revs, 3 users 96% Avatar answered Sep 28 '22 04:09

12 revs, 3 users 96%