Suppose I sample a selection of database records that return the following numbers:
20.50, 80.30, 70.95, 15.25, 99.97, 85.56, 69.77
Is there an algorithm that can be efficiently implemented in PHP to find the outliers (if there are any) from an array of floats based on how far they deviate from the mean?
Ok let's assume you have your data points in an array like so:
<?php $dataset = array(20.50, 80.30, 70.95, 15.25, 99.97, 85.56, 69.77); ?>
Then you can use the following function (see comments for what is happening) to remove all numbers that fall outside of the mean +/- the standard deviation times a magnitude you set (defaults to 1):
<?php
function remove_outliers($dataset, $magnitude = 1) {
$count = count($dataset);
$mean = array_sum($dataset) / $count; // Calculate the mean
$deviation = sqrt(array_sum(array_map("sd_square", $dataset, array_fill(0, $count, $mean))) / $count) * $magnitude; // Calculate standard deviation and times by magnitude
return array_filter($dataset, function($x) use ($mean, $deviation) { return ($x <= $mean + $deviation && $x >= $mean - $deviation); }); // Return filtered array of values that lie within $mean +- $deviation.
}
function sd_square($x, $mean) {
return pow($x - $mean, 2);
}
?>
For your example this function returns the following with a magnitude of 1:
Array
(
[1] => 80.3
[2] => 70.95
[5] => 85.56
[6] => 69.77
)
For a normally distributed set of data, removes values more than 3 standard deviations from the mean.
<?php
function remove_outliers($array) {
if(count($array) == 0) {
return $array;
}
$ret = array();
$mean = array_sum($array)/count($array);
$stddev = stats_standard_deviation($array);
$outlier = 3 * $stddev;
foreach($array as $a) {
if(!abs($a - $mean) > $outlier) {
$ret[] = $a;
}
}
return $ret;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With