Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pick item in array by percentage

Tags:

ruby

I have an array which contains names and percentages. Example: [["JAMES", 3.318], ["JOHN", 3.271], ["ROBERT", 3.143]]. Now I have about a thousand of these names, and I'm trying to figure out how to choose a name randomly based on the percentage of the name (like how James as 3.318% and John as 3.271%), so that name will have that percentage of being picked (Robert will have a 3.143% of being picked). Help would be appreciated.

like image 600
Darabramandamus Avatar asked May 19 '18 15:05

Darabramandamus


4 Answers

You can use max_by: (the docs contain a similar example)

array.max_by { |_, weight| rand ** 1.fdiv(weight) }

This assumes that your weights are actual percentages, i.e. 3.1% has to be expressed as 0.031. Or, if you don't want to adjust your weights:

array.max_by { |_, weight| rand ** 100.fdiv(weight) }

I'm using fdiv here to account for possible integer values. If your weights are always floats, you can also use /.

like image 128
Stefan Avatar answered Oct 23 '22 05:10

Stefan


Even though I like @Stefan answer more than mine, I will contribute with a possible solution: I would distribute all my percentages along 100.0 so that they start from 0.0 and end to 100.0. Imagine I have an array with the following percentages:

a = [10.5, 20.5, 17.8, 51.2]

where

a.sum = 100.0

We could write the following to distribute them along 100.0:

sum = 0.0
b = a.map { |el| sum += el }

and the result would be

b = [10.5, 31.0, 48.8, 100.0]

now I can generate a random number between 0.0 and 100.0:

r = rand(0.0..100.0) # or r = rand * 100.0

imagine r is 45.32.

I select the first element of b that is >= r`

idx = b.index { |el| el >= r }

which in our case would return 2.

Now you can select a[idx].

But I would go with @Stefan answer as well :)

like image 4
coorasse Avatar answered Oct 23 '22 07:10

coorasse


I assume you will be drawing multiple random values, in which case efficiency is important. Moreover, I assume that all names are unique and all percentages are positive (i.e., that pairs with percentages of 0.0 have been removed).

You are given what amounts to a (discrete) probability density function (PDF). The first step is to convert that to a cumulative density function (CDF).

Suppose we are given the following array (whose percentages sum to 100).

arr = [["LOIS", 28.16], ["JAMES", 22.11], ["JOHN", 32.71], ["ROBERT", 17.02]]

First, separate the names from the percentages.

names, probs = arr.transpose
  #=> [["LOIS", "JAMES", "JOHN", "ROBERT"],
  #     [28.16, 22.11, 32.71, 17.02]]

Next compute the CDF.

cdf = probs.drop(1).
            each_with_object([0.01 * probs.first]) { |pdf, cdf|
              cdf << 0.01 * pdf + cdf.last }
  #=> [0.2816, 0.5027, 0.8298, 1.0]

The idea is that we will generate a (pseudo) random number between zero and one, r and find the first value c of the CDF for which r <= c.1 To do this in an efficient way we will perform an intelligent search of the CDF. This is possible because the CDF is an increasing function.

I will do a binary search, using Array#bsearch_index. This method is essentially the same as Array#bseach (whose doc is the relevant one), except the index of cdf is returned rather than the element of cdf is randomly selected. It will shortly be evident why we want the index.

r = rand
  #=> 0.6257547400776025
idx = cdf.bsearch_index { |c| r <= c }
  #=> 2

Note that we cannot write cdf.bsearch_index { |c| rand <= c } as rand would be executed each time the block is evaluated.

The randomly-selected name is therefore2

names[idx]
  #=> "JOHN"

Now let's put all this together.

def setup(arr)
  @names, probs = arr.transpose
  @cdf = probs.drop(1).
    each_with_object([0.01*probs.first]) { |pdf, cdf| cdf << 0.01 * pdf + cdf.last }
end

def random_name
  r = rand
  @names[@cdf.bsearch_index { |c| r <= c }]
end

Let's try it. Execute setup to compute the instance variables @names and @cdf.

setup(arr)
@names
  #=> ["LOIS", "JAMES", "JOHN", "ROBERT"]
@cdf
  #=> [0.2816, 0.5027, 0.8298, 1.0]

and then call random_name each time a random name is wanted.

5.times.map { random_name }
  #=> ["JOHN", "LOIS", "JAMES", "LOIS", "JAMES"]

1. This is how most discrete random variates are generated in simulation models.

2. Had I used bsearch rather than bsearch_index I would have had to earlier create a hash with cdf=>name key-value pairs in order to retrieve a name for a given randomly-selected CDF value.

like image 2
Cary Swoveland Avatar answered Oct 23 '22 06:10

Cary Swoveland


This is my solution to the problem:

array = [["name1", 33],["name2", 20],["name3",10],["name4",7],["name5", 30]]

def random_name(array)
  random_number = rand(0.000..100.000) 
  sum = 0

array.each do |x|
  if random_number.between?(sum, sum + x[1])
    return x[0]
  else
    sum += x[1]
  end
end
end

puts random_name(array)
like image 1
Viktor Avatar answered Oct 23 '22 07:10

Viktor