Given an array of n
word-frequency pairs:
[ (w0, f0), (w1, f1), ..., (wn-1, fn-1) ]
where wi
is a word, fi
is an integer frequencey, and the sum of the frequencies ∑fi = m
,
I want to use a pseudo-random number generator (pRNG) to select p
words wj0, wj1, ..., wjp-1
such that
the probability of selecting any word is proportional to its frequency:
P(wi = wjk) = P(i = jk) = fi / m
(Note, this is selection with replacement, so the same word could be chosen every time).
I've come up with three algorithms so far:
Create an array of size m
, and populate it so the first f0
entries are w0
, the next f1
entries are w1
, and so on, so the last fp-1
entries are wp-1
.
[ w0, ..., w0, w1,..., w1, ..., wp-1, ..., wp-1 ]Then use the pRNG to select
p
indices in the range 0...m-1
, and report the words stored at those indices.O(n + m + p)
work, which isn't great, since m
can be much much larger than n.Step through the input array once, computing
mi = ∑h≤ifh = mi-1 + fiand after computing
mi
, use the pRNG to generate a number xk
in the range 0...mi-1
for each k
in 0...p-1
and select wi
for wjk
(possibly replacing the current value of wjk
) if xk < fi
.O(n + np)
work.mi
as in algorithm 2, and generate the following array on n word-frequency-partial-sum triples:[ (w0, f0, m0), (w1, f1, m1), ..., (wn-1, fn-1, mn-1) ]and then, for each k in
0...p-1
, use the pRNG to generate a number xk
in the range 0...m-1
then do binary search on the array of triples to find the i
s.t. mi-fi ≤ xk < mi
, and select wi
for wjk
.O(n + p log n)
work.My question is: Is there a more efficient algorithm I can use for this, or are these as good as it gets?
This sounds like roulette wheel selection, mainly used for the selection process in genetic/evolutionary algorithms.
Look at Roulette Selection in Genetic Algorithms
You could create the target array, then loop through the words determining the probability that it should be picked, and replace the words in the array according to a random number.
For the first word the probability would be f0/m0 (where mn=f0+..+fn), i.e. 100%, so all positions in the target array would be filled with w0.
For the following words the probability falls, and when you reach the last word the target array is filled with randomly picked words accoding to the frequency.
Example code in C#:
public class WordFrequency {
public string Word { get; private set; }
public int Frequency { get; private set; }
public WordFrequency(string word, int frequency) {
Word = word;
Frequency = frequency;
}
}
WordFrequency[] words = new WordFrequency[] {
new WordFrequency("Hero", 80),
new WordFrequency("Monkey", 4),
new WordFrequency("Shoe", 13),
new WordFrequency("Highway", 3),
};
int p = 7;
string[] result = new string[p];
int sum = 0;
Random rnd = new Random();
foreach (WordFrequency wf in words) {
sum += wf.Frequency;
for (int i = 0; i < p; i++) {
if (rnd.Next(sum) < wf.Frequency) {
result[i] = wf.Word;
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With