I'm creating a python script that randomly picks 1000 names from the list of male first names located here: http://www.census.gov/genealogy/www/data/1990surnames/names_files.html
That works all fine and dandy, but I would like it so that the names were selected based on the probability column provided by the census text files (second column).
I've been trying to wrap my head around this for the past few hours, but I haven't made any real progress, even looking for other answers.
Can anybody help me out or point me in the right direction? Thanks in advance :)
Relative weights to choose elements from the list with different probability. First, define the probability for each element. If you specified the probability using the relative weight, the selections are made according to the relative weights. You can set relative weights using the weight parameter.
random. choice() method to choose elements from the list with different probability. Output: Return the numpy array of random samples. Note: parameter p is probabilities associated with each entry in a(1d-array).
The simplest way to use Python to select a single random element from a list in Python is to use the random. choice() function. The function takes a single parameter – a sequence. In this case, our sequence will be a list, though we could also use a tuple.
An easy algorithm for weighted selection is:
Assign each name its relative probability, such that the sum of all probabilities is 1. This relative value is called "weight".
Select a random number between 0 and 1
Walk the list, substracting the weight of each item from your number as you go
When you go to 0 or below, pick the current item.
The third column of the data file is the cumulative probability, the running sum of the second column.
To select a random name with respect to the cumulative probability distribution:
import urllib2
import random
import bisect
url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'
response = urllib2.urlopen(url)
names, cumprobs = [], []
for line in response:
name, prob, cumprob, rank = line.split()
cumprob = float(cumprob)
names.append(name)
cumprobs.append(cumprob)
# normalize the cumulative probabilities to the range [0, 1]
cumprobs = [p/cumprobs[-1] for p in cumprobs]
# print(cumprobs)
# Generate 1000 names at random, using the cumulative probability distribution
N = 1000
selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)]
print('\n'.join(selected))
Note, the alias method has better computational complexity, but for selection of a mere 1000 items, that may not be very important for your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With