I have an array which contains names and percentages.
Example:
[["JAMES", 3.318], ["JOHN", 3.271], ["ROBERT", 3.143]]
.
Now I have about a thousand of these names, and I'm trying to figure out how to choose a name randomly based on the percentage of the name (like how James as 3.318% and John as 3.271%), so that name will have that percentage of being picked (Robert will have a 3.143% of being picked). Help would be appreciated.
You can use max_by
: (the docs contain a similar example)
array.max_by { |_, weight| rand ** 1.fdiv(weight) }
This assumes that your weights are actual percentages, i.e. 3.1% has to be expressed as 0.031
. Or, if you don't want to adjust your weights:
array.max_by { |_, weight| rand ** 100.fdiv(weight) }
I'm using fdiv
here to account for possible integer values. If your weights are always floats, you can also use /
.
Even though I like @Stefan answer more than mine, I will contribute with a possible solution: I would distribute all my percentages along 100.0
so that they start from 0.0 and end to 100.0.
Imagine I have an array with the following percentages:
a = [10.5, 20.5, 17.8, 51.2]
where
a.sum = 100.0
We could write the following to distribute them along 100.0:
sum = 0.0
b = a.map { |el| sum += el }
and the result would be
b = [10.5, 31.0, 48.8, 100.0]
now I can generate a random number between 0.0 and 100.0:
r = rand(0.0..100.0) # or r = rand * 100.0
imagine r is 45.32
.
I select the first element of b
that is >= r`
idx = b.index { |el| el >= r }
which in our case would return 2
.
Now you can select a[idx]
.
But I would go with @Stefan answer as well :)
I assume you will be drawing multiple random values, in which case efficiency is important. Moreover, I assume that all names are unique and all percentages are positive (i.e., that pairs with percentages of 0.0
have been removed).
You are given what amounts to a (discrete) probability density function (PDF). The first step is to convert that to a cumulative density function (CDF).
Suppose we are given the following array (whose percentages sum to 100).
arr = [["LOIS", 28.16], ["JAMES", 22.11], ["JOHN", 32.71], ["ROBERT", 17.02]]
First, separate the names from the percentages.
names, probs = arr.transpose
#=> [["LOIS", "JAMES", "JOHN", "ROBERT"],
# [28.16, 22.11, 32.71, 17.02]]
Next compute the CDF.
cdf = probs.drop(1).
each_with_object([0.01 * probs.first]) { |pdf, cdf|
cdf << 0.01 * pdf + cdf.last }
#=> [0.2816, 0.5027, 0.8298, 1.0]
The idea is that we will generate a (pseudo) random number between zero and one, r
and find the first value c
of the CDF for which r <= c
.1 To do this in an efficient way we will perform an intelligent search of the CDF. This is possible because the CDF is an increasing function.
I will do a binary search, using Array#bsearch_index. This method is essentially the same as Array#bseach (whose doc is the relevant one), except the index of cdf
is returned rather than the element of cdf
is randomly selected. It will shortly be evident why we want the index.
r = rand
#=> 0.6257547400776025
idx = cdf.bsearch_index { |c| r <= c }
#=> 2
Note that we cannot write cdf.bsearch_index { |c| rand <= c }
as rand
would be executed each time the block is evaluated.
The randomly-selected name is therefore2
names[idx]
#=> "JOHN"
Now let's put all this together.
def setup(arr)
@names, probs = arr.transpose
@cdf = probs.drop(1).
each_with_object([0.01*probs.first]) { |pdf, cdf| cdf << 0.01 * pdf + cdf.last }
end
def random_name
r = rand
@names[@cdf.bsearch_index { |c| r <= c }]
end
Let's try it. Execute setup
to compute the instance variables @names
and @cdf
.
setup(arr)
@names
#=> ["LOIS", "JAMES", "JOHN", "ROBERT"]
@cdf
#=> [0.2816, 0.5027, 0.8298, 1.0]
and then call random_name
each time a random name is wanted.
5.times.map { random_name }
#=> ["JOHN", "LOIS", "JAMES", "LOIS", "JAMES"]
1. This is how most discrete random variates are generated in simulation models.
2. Had I used bsearch
rather than bsearch_index
I would have had to earlier create a hash with cdf=>name
key-value pairs in order to retrieve a name for a given randomly-selected CDF value.
This is my solution to the problem:
array = [["name1", 33],["name2", 20],["name3",10],["name4",7],["name5", 30]]
def random_name(array)
random_number = rand(0.000..100.000)
sum = 0
array.each do |x|
if random_number.between?(sum, sum + x[1])
return x[0]
else
sum += x[1]
end
end
end
puts random_name(array)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With