Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for probability when looping over a randomly ordered array

The problem is pretty straightforward, I think, by looking at the code. I have a randomized array (the array must be randomized, some code has been excluded because it doesn't pertain to the actual problem, but does require randomization). For each element in the array, there is a "probability" index (described here as the value itself, in $rules) that is suppose to hint that, if other conditions are met (that are removed here for the sake of non-relevancy), the probability that array element will be "triggered" (in this case, that the array element's score will increment by 1)

Consider the code:

<?php
  // Taken from php.net/shuffle user notes
  // Shuffles an array order for the sake of foreach while maintaining
  // key => value associations
  function shuffle_assoc(&$array) {
    $keys = array_keys($array);
    shuffle($keys);
    foreach($keys as $key) {
      $new[$key] = $array[$key];
    }
    return $new;
  }

  $i = 1000000; // How many tests to perform

  // This is my rule list.  Each key is a simple color
  // and each value is a probability represented as a percent
  $rules = array(
    'black' => 20,
    'white' => 10,
    'red' => 40,
    'green' => 5,
    'blue' => 25,
  );

  // Initialize the scores array with all 0's
  // The "outs" will be used when the probability does not
  // occur in any of the rules
  $scores = array('outs' => 0);
  foreach($rules as $k => $v) { 
    $scores[$k] = 0;
  }

  $count = count($rules);

  for($x = 0; $x < $i; $x++) { 
    $rules = shuffle_assoc($rules);

    foreach($rules as $k => $probability) {
      $rand = mt_rand(1,100);
      //$probability = ??; I've tried applying many different operations here to "correct" the probability

      if($rand > $probability) { 
        continue; 
      } else {
        $scores[$k]++;
        continue 2;
      }
    }
    $scores['outs']++;
  }


  foreach($scores as $k => $v) { 
    echo "$k: " . (($v/$i)*100) . "% ($v/$i)\n";
  }

?>

Expected output (pseudo). Note the percentages correspond with the values of $rules

outs: less than 1% (.../1000000)
black: 20% (.../1000000)
white: 10% (.../1000000)
red: 40% (.../1000000)
green: 5% (.../1000000)
blue: 25% (.../1000000)

Example output:

outs: 30.7128% (307128/1000000)
black: 13.2114% (132114/1000000)
white: 6.3381% (63381/1000000)
red: 29.5247% (295247/1000000)
green: 3.1585% (31585/1000000)
blue: 17.0545% (170545/1000000)

Things I've tried & Considerations:

  • As you can see, within the loop I have a commented out section of $probability = ?? which I've tried various obvious-to-me methods of calculating the actual probability to use within each element, including playing with $count (count of rules) which is why that variable exists and isn't used.

  • It doesn't have to be exact obviously, but preferably has stable results over a smaller set of numbers (e.x. 1,000 iterations).

  • It can be pretty fuzzy. A variance of +/- 5% wouldn't hurt my feelings, especially in smaller numbers of iterations, I understand big number theory comes to play here.

  • The number of outs isn't a big deal as long as they're less than 1%-2%. I also tried eliminating outs using various methods to see if the outs alone were skewing, and interestingly enough when I did that on one occasion, I got a 20% split all around (i.e. even).

  • Furthermore, on "outs", I was able to get pretty close to the proper split with very little outs by basically brute-forcing the probability "numbers" (that is, the values of $rules) starting from 100 backwards, but I was never able to find out a precise, optimal method. Each time, I would get closer to the result for one color, that would skew the other colors on a small but noticeable scale. There was no easy-for-me-to-grasp correlation in these numbers and were seemingly random although it is obvious that the results played well with probability vs big numbers.

Tell me there is a precise way to calculate this. It's driving me nuts.

Edit: I have a finalized version of my code, with the help from the two answers below, that does this without the need for knowing probability percentages before the loop begins, and no additional or nested loops (which is what I specifically needed, I guess I should of been more direct in that part) .. In the sense of, each iteration, you could be pulling the probability dynamically based on that specific iteration's properties.. All answers here were invaluable, here is my version of the final code: http://pastebin.com/eB3TVP1E

like image 883
A.B. Carroll Avatar asked Jan 17 '13 22:01

A.B. Carroll


2 Answers

Just normalize the results, accumulate them and then you are done.

What I mean is:

  • sum all probabilities given for every item of the array to get the total (which is 100 in your case but it's easily generalizable)
  • divide every probability for the total

So for example:

$rules = array(
    'black' => 20,
    'white' => 10,
    'red' => 40,
    'green' => 5,
    'blue' => 25,
  );

will be normalized to:

$rules_norm = array(
    'black' => 0.2,
    'white' => 0.1,
    'red' => 0.4,
    'green' => 0.05,
    'blue' => 0.25,
  );
  • now accumulate the result so that for every element in $rules_norm you calculate the sum of all previous elements plus the current one.

So:

$rules_norm = array(
    'black' => 0.2,
    'white' => 0.3,
    'red' => 0.7,
    'green' => 0.75,
    'blue' => 1.0,
  );

Now with this you can just extract a random float number in range [0,1) and choose which elements are increased according to the result: to increment the score of one element just start from the first one in the array and increment the one such that $rand > $rules_norm[k]

like image 163
Jack Avatar answered Sep 17 '22 16:09

Jack


Jack's idea implemented in your code (if the sum of probabilities is >100 this won't work):

php fiddle

<?php
  // Taken from php.net/shuffle user notes
  // Shuffles an array order for the sake of foreach while maintaining
  // key => value associations
  function shuffle_assoc(&$array) {
    $keys = array_keys($array);
    shuffle($keys);
    foreach($keys as $key) {
      $new[$key] = $array[$key];
    }
    return $new;
  }

  $i = 1000000; // How many tests to perform

  // This is my rule list.  Each key is a simple color
  // and each value is a probability represented as a percent
  $rules = array(
    'black' => 20,
    'white' => 10,
    'red' => 40,
    'green' => 5,
    'blue' => 25,
  );

  // Initialize the scores array with all 0's
  // The "outs" will be used when the probability does not
  // occur in any of the rules
  $scores = array('outs' => 0);
  foreach($rules as $k => $v) { 
    $scores[$k] = 0;
  }

  $count = count($rules);
//$limits is what Jack called $rules_norm
$limits=array();
$limit=0;
foreach($rules as $k=>$v)
{
    $limit+=$v;
    $limits[$k]=$limit;
}
  for($x = 0; $x < $i; $x++) { 
      $rand = mt_rand(1,100);
foreach($limits as $k=>$v)
{
    if($v>=$rand)
    {
        $scores[$k]++;
        continue(2);
    }

}
    $scores['outs']++;
  }


  foreach($scores as $k => $v) { 
    echo "$k: " . (($v/$i)*100) . "% ($v/$i)\n";
  }

?>
like image 40
Jon Hulka Avatar answered Sep 18 '22 16:09

Jon Hulka