Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing elements from python list based on probability

Tags:

python

I'm creating a python script that randomly picks 1000 names from the list of male first names located here: http://www.census.gov/genealogy/www/data/1990surnames/names_files.html

That works all fine and dandy, but I would like it so that the names were selected based on the probability column provided by the census text files (second column).

I've been trying to wrap my head around this for the past few hours, but I haven't made any real progress, even looking for other answers.

Can anybody help me out or point me in the right direction? Thanks in advance :)

like image 571
IrateIrish Avatar asked Mar 28 '14 19:03

IrateIrish


People also ask

How do you choose elements from a list with different probabilities?

Relative weights to choose elements from the list with different probability. First, define the probability for each element. If you specified the probability using the relative weight, the selections are made according to the relative weights. You can set relative weights using the weight parameter.

How do you select elements from the list with different probability using Numpy?

random. choice() method to choose elements from the list with different probability. Output: Return the numpy array of random samples. Note: parameter p is probabilities associated with each entry in a(1d-array).

How do I get Python to pick a random item in a list?

The simplest way to use Python to select a single random element from a list in Python is to use the random. choice() function. The function takes a single parameter – a sequence. In this case, our sequence will be a list, though we could also use a tuple.


2 Answers

An easy algorithm for weighted selection is:

  1. Assign each name its relative probability, such that the sum of all probabilities is 1. This relative value is called "weight".

  2. Select a random number between 0 and 1

  3. Walk the list, substracting the weight of each item from your number as you go

  4. When you go to 0 or below, pick the current item.

like image 162
slezica Avatar answered Oct 21 '22 19:10

slezica


The third column of the data file is the cumulative probability, the running sum of the second column.

To select a random name with respect to the cumulative probability distribution:

  1. Generate a random number between 0 and 1,
  2. Find the first row whose cumulative probability is bigger than that random number.
  3. Select the name in that row.

import urllib2
import random
import bisect

url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'
response = urllib2.urlopen(url)
names, cumprobs = [], []
for line in response:
    name, prob, cumprob, rank = line.split()
    cumprob = float(cumprob)
    names.append(name)
    cumprobs.append(cumprob)

# normalize the cumulative probabilities to the range [0, 1]
cumprobs = [p/cumprobs[-1] for p in cumprobs]
# print(cumprobs)

# Generate 1000 names at random, using the cumulative probability distribution
N = 1000
selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)]
print('\n'.join(selected))

Note, the alias method has better computational complexity, but for selection of a mere 1000 items, that may not be very important for your use case.

like image 25
unutbu Avatar answered Oct 21 '22 17:10

unutbu