Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm to randomly select items with frequency

Given an array of n word-frequency pairs:

[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]

where wi is a word, fi is an integer frequencey, and the sum of the frequencies ∑fi = m,

I want to use a pseudo-random number generator (pRNG) to select p words wj0, wj1, ..., wjp-1 such that the probability of selecting any word is proportional to its frequency:

P(wi = wjk) = P(i = jk) = fi / m

(Note, this is selection with replacement, so the same word could be chosen every time).

I've come up with three algorithms so far:

  1. Create an array of size m, and populate it so the first f0 entries are w0, the next f1 entries are w1, and so on, so the last fp-1 entries are wp-1.

    [ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]
    Then use the pRNG to select p indices in the range 0...m-1, and report the words stored at those indices.
    This takes O(n + m + p) work, which isn't great, since m can be much much larger than n.
  2. Step through the input array once, computing

    mi = ∑h≤ifh = mi-1 + fi
    and after computing mi, use the pRNG to generate a number xk in the range 0...mi-1 for each k in 0...p-1 and select wi for wjk (possibly replacing the current value of wjk) if xk < fi.
    This requires O(n + np) work.
  3. Compute mi as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:
    [ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]
    and then, for each k in 0...p-1, use the pRNG to generate a number xk in the range 0...m-1 then do binary search on the array of triples to find the i s.t. mi-fi ≤ xk < mi, and select wi for wjk.
    This requires O(n + p log n) work.

My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?

like image 748
rampion Avatar asked May 16 '09 14:05

rampion


2 Answers

This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.

Look at Roulette Selection in Genetic Algorithms

like image 62
seb Avatar answered Nov 16 '22 16:11

seb


You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.

For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.

For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.

Example code in C#:

public class WordFrequency {

    public string Word { get; private set; }
    public int Frequency { get; private set; }

    public WordFrequency(string word, int frequency) {
        Word = word;
        Frequency = frequency;
    }

}

WordFrequency[] words = new WordFrequency[] {
    new WordFrequency("Hero", 80),
    new WordFrequency("Monkey", 4),
    new WordFrequency("Shoe", 13),
    new WordFrequency("Highway", 3),
};

int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
    sum += wf.Frequency;
    for (int i = 0; i < p; i++) {
        if (rnd.Next(sum) < wf.Frequency) {
            result[i] = wf.Word;
        }
    }
}
like image 2
Guffa Avatar answered Nov 16 '22 16:11

Guffa