Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample a truncated integer power law in Python?

What function can I use in Python if I want to sample a truncated integer power law?

That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.

I've been searching around numpy.random, but I haven't found this distribution.

like image 524
becko Avatar asked Jul 04 '14 18:07

becko


2 Answers

AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

def truncated_power_law(a, m):
    x = np.arange(1, m+1, dtype='float')
    pmf = 1/x**a
    pmf /= pmf.sum()
    return stats.rv_discrete(values=(range(1, m+1), pmf))

a, m = 2, 10
d = truncated_power_law(a=a, m=m)

N = 10**4
sample = d.rvs(size=N)

plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()

enter image description here

like image 65
unutbu Avatar answered Nov 07 '22 15:11

unutbu


I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.

Setup:

  1. Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.

  2. Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.

Usage:

  • Generate your random values by generating a Uniform(0,1) and searching through cdf[] until you find an entry greater than your uniform. Return the index + 1 as your x-value.

Repeat for as many x-values as you want.

For instance, with a,m = 2,10, I calculate the probabilities directly as:

[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]

and the CDF is:

[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]

When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.

If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.

like image 29
pjs Avatar answered Nov 07 '22 15:11

pjs