Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate a random series of values (B) that have a given correlation with a given series (A)

For an educational website, it is my purpose to let students fool around somewhat with value series and their collelation. For instance, students can enter two arrays for which the correlation is calculated:

$array_x = array(5,3,6,7,4,2,9,5);
$array_y = array(4,3,4,8,3,2,10,5);

echo Correlation($array_x, $array_y); // 0.93439982209434

The code for this works perfectly and can be found at the bottom of this post. I'm however now facing a challenge. What I want is the following:

  • student inputs a $array_x (5,3,6,7,4,2,9,5)
  • student inputs a correlation (0.9)
  • student inputs the boundaries of $array_y (for instance, between 1 and 10 or between 50 and 80)
  • the script returns a random array (for instance: 4,3,4,8,3,2,10,5) which has (about) the given correlation

So, in other words, the code would have to work like:

$array_x = array(5,3,6,7,4,2,9,5);
$boundaries = array(1, 10);
$correlation = 0.9;

echo ySeries($array_x, $boundaries, $correlation); // array(4,3,4,8,3,2,10,5)

At the Stackexchange Math forum, @ilya answered (inserted as an image, since Latex formatting of fomulas don't seem to work on stackoverflow):

enter image description here

P.S. The code used to calculate the correlation:

function Correlation($arr1, $arr2) {        
  $correlation = 0;  
  $k = SumProductMeanDeviation($arr1, $arr2);
  $ssmd1 = SumSquareMeanDeviation($arr1);
  $ssmd2 = SumSquareMeanDeviation($arr2);
  $product = $ssmd1 * $ssmd2;
  $res = sqrt($product);
  $correlation = $k / $res;

  return $correlation;
}

function SumProductMeanDeviation($arr1, $arr2) {
  $sum = 0;
  $num = count($arr1);
  for($i=0; $i < $num; $i++) {
    $sum = $sum + ProductMeanDeviation($arr1, $arr2, $i);
  }
  return $sum;
}

function ProductMeanDeviation($arr1, $arr2, $item) {
  return (MeanDeviation($arr1, $item) * MeanDeviation($arr2, $item));
}

function SumSquareMeanDeviation($arr) {
  $sum = 0;
  $num = count($arr);
  for($i = 0; $i < $num; $i++) {
    $sum = $sum + SquareMeanDeviation($arr, $i);
  }
  return $sum;
}

function SquareMeanDeviation($arr, $item) {
  return MeanDeviation($arr, $item) * MeanDeviation($arr, $item);
}

function SumMeanDeviation($arr) {
  $sum = 0;
  $num = count($arr);
  for($i = 0; $i < $num; $i++) {
    $sum = $sum + MeanDeviation($arr, $i);
  }
  return $sum;
}

function MeanDeviation($arr, $item) {
  $average = Average($arr);
  return $arr[$item] - $average;
}    

function Average($arr) {
  $sum = Sum($arr);
  $num = count($arr);
  return $sum/$num;
}

function Sum($arr) {
  return array_sum($arr);
}
like image 615
Pr0no Avatar asked May 14 '12 22:05

Pr0no


1 Answers

So, here's the php implementation of your algorithm that uses Dawkins' weasel to reduce the error gradually until the desired threshold.

<?php
function sqrMeanDeviation($array, $avg)
{
    $sqrMeanDeviation = 0;
    for($i=0; $i<count($array); $i++)
    {
        $dev = $array[$i] - $avg;
        $sqrMeanDeviation += $dev * $dev;
    }

    return $sqrMeanDeviation;
}

// z values are non-0 an can value between [-abs_z_bound, abs_z_bound]
function random_z_element($abs_z_bound = 1)
{
    $a = (mt_rand() % (2*$abs_z_bound) ) - ($abs_z_bound-1);
    if($a <= 0)
        $a--;
    return $a;
}

// change z a little
function copy_z_weasel($old_array_z, $error_probability = 20 /*error possible is 1 in error_probability*/, $abs_z_bound = 1)
{
    $new_z = array();

    for($i = 0; $i < count($old_array_z); $i++)
        if(mt_rand() % $error_probability == 0 )
            $new_z[$i] = random_z_element($abs_z_bound);
        else
            $new_z[$i] = $old_array_z[$i];

    return $new_z;
}

function correlation_error($array_y, $array_x, $avg_x, $sqrMeanDeviation_x, $correlation)
{
    // checking correlation
    $avg_y = array_sum($array_y)/count($array_y);

    $sqrMeanDeviation_y = 0;
    $covariance_xy = 0;

    for($i=0; $i<count($array_x); $i++)
    {
        $dev_y = $array_y[$i] - $avg_y;
        $sqrMeanDeviation_y += $dev_y * $dev_y;

        $dev_x = $array_x[$i] - $avg_x;
        $covariance_xy += $dev_y * $dev_x;
    }
    $correlation_xy = $covariance_xy/sqrt($sqrMeanDeviation_x*$sqrMeanDeviation_y);
    return abs($correlation_xy - $correlation);
}

function ySeries($array_x, $low_bound, $high_bound, $correlation, $threshold)
{
    $array_y = array();

    $avg_x = array_sum($array_x)/count($array_x);
    $sqrMeanDeviation_x = sqrMeanDeviation($array_x, $avg_x);

    // pre-compute beta
    $beta_x_sQMz = $sqrMeanDeviation_x * sqrt( 1 / ($correlation*$correlation) - 1 );

    $best_array_z = array();
    $n = 0;
    $error = $threshold + 1;

    while($error > $threshold)
    {
        ++$n;

        // generate z 
        $array_z = array();
        if(count($best_array_z) == 0)
            for($i=0; $i<count($array_x); $i++)
                $array_z[$i] = random_z_element();
        else
            $array_z = copy_z_weasel($best_array_z);

        $sqm_z = sqrMeanDeviation($array_z, array_sum($array_z)/count($array_z) );
        // this being 0 implies that for every beta correlation(x,y) = 1 so just give it any random beta
        if($sqm_z)
            $beta = $beta_x_sQMz / $sqm_z;
        else
            $beta = 10;
        // and now we have y
        for($i=0; $i<count($array_x); $i++)
            $array_y[$i] = $array_x[$i] + ($array_z[$i] * $beta);

        // now, change bounds (we could do this afterwards but we want precision and y to be integers)
        // rounding
        $min_y = $array_y[0];
        $max_y = $array_y[0];
        for( $i=1; $i<count($array_x); $i++ )
        {
            if($array_y[$i] < $min_y)
                $min_y = $array_y[$i];
            if($array_y[$i] > $max_y)
                $max_y = $array_y[$i];
        }

        $range = ($high_bound - $low_bound) / ($max_y - $min_y);
        $shift = $low_bound - $min_y;
        for( $i=0; $i<count($array_x); $i++ )
            $array_y[$i] = round($array_y[$i] * $range + $shift);

        // get the error
        $new_error = correlation_error($array_y, $array_x, $avg_x, $sqrMeanDeviation_x, $correlation);

        if($new_error < $error)
        {
            $best_array_z = $array_z;
            $error = $new_error;
        }

    }
    echo "Correlation ", $correlation, " approched within " , $new_error, " in ", $n ," iterations.\n";

    return $array_y;
}

?>
like image 163
Cimbali Avatar answered Oct 25 '22 16:10

Cimbali