Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select a value from a list with non-uniform probabilities?

I am looking at the k-means++ initialization algorithm. The following two steps of the algorithm give rise to non-uniform probabilities:

For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.

Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2.

How can I select with this stated weighted probability distribution in C++?

like image 818
zebra Avatar asked Dec 19 '11 22:12

zebra


3 Answers

Discrete distributions is a lot easier to do in C++11 with the random header and using std::discrete_distribution. This is example:

#include <iostream>
#include <map>
#include <random>

int main()
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::discrete_distribution<> d({20,30,40,10});
    std::map<int, int> m;
    for(int n=0; n<10000; ++n) {
        ++m[d(gen)];
    }
    for(auto p : m) {
        std::cout << p.first << " generated " << p.second << " times\n";
    }
}

and this is a sample of the output:

0 generated 2003 times
1 generated 3014 times
2 generated 4021 times
3 generated 962 times
like image 153
Shafik Yaghmour Avatar answered Nov 14 '22 22:11

Shafik Yaghmour


With a finite set of individual data points X, this calls for a discrete probability distribution.

The easiest way to do this is to enumerate the points X in order, and calculate an array representing their cumulative probability distribution function: (pseudocode follows)

/* 
 * xset is an array of points X,
 * cdf is a preallocated array of the same size
 */
function prepare_cdf(X[] xset, float[] cdf)
{
   float S = 0;
   int N = sizeof(xset);
   for i = 0:N-1
   {
      float weight = /* calculate D(xset[i])^2 here */
      // create cumulative sums and write to the element in cdf array
      S += weight;
      cdf[i] = S;
   }

   // now normalize so the CDF runs from 0 to 1
   for i = 0:N-1
   {
      cdf[i] /= S;
   }
}

function select_point(X[] xset, float[] cdf, Randomizer r)
{
   // generate a random floating point number from a 
   // uniform distribution from 0 to 1
   float p = r.nextFloatUniformPDF();
   int i = binarySearch(cdf, p);
   // find the lowest index i such that p < cdf[i]

   return xset[i];
}

You call prepare_cdf once, and then call select_point as many times as you need to generate random points.

like image 30
Jason S Avatar answered Nov 14 '22 22:11

Jason S


I'd take the following approach:

  • iterate over the data-points, storing their D-squared's in a double distance_squareds[] or std::vector<double> distance_squareds or whatnot, and storing the sum of their D-squared's in a double sum_distance_squareds.
  • use the drand48 function to choose a random number in [0.0, 1.0), and multiply it by sum_distance_squareds; store the result in random_number.
  • iterate over distance_squareds, adding together the values (again), and as soon as the running total meets or exceeds random_number, return the data-point corresponding to the D-squared that you'd just added.
  • due to round-off error, it's remotely possible that you'll finish the loop without having returned; if so, just return the first data-point, or the last one, or whatever. (But don't worry, this should be a very rare edge case.)
like image 31
ruakh Avatar answered Nov 14 '22 22:11

ruakh