Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate average without being thrown by strays

I am trying to calculate an average without being thrown off by a small set of far off numbers (ie, 1,2,1,2,3,4,50) the single 50 will throw off the entire average.

If I have a list of numbers like so:

19,20,21,21,22,30,60,60

The average is 31

The median is 30

The mode is 21 & 60 (averaged to 40.5)

But anyone can see that the majority is in the range 19-22 (5 in, 3 out) and if you get the average of just the major range it's 20.6 (a big difference than any of the numbers above)

I am thinking that you can get this like so:

c+d-r

Where c is the count of a numbers, d is the distinct values, and r is the range. Then you can apply this to all the possble ranges, and the highest score is the omptimal range to get an average from.

For example 19,20,21,21,22 would be 5 numbers, 4 distinct values, and the range is 3 (22 - 19). If you plug this into my equation you get 5+4-3=6

If you applied this to the entire number list it would be 8+6-41=-27

I think this works pretty good, but I have to create a huge loop to test against all possible ranges. In just my small example there are 21 possible ranges:

19-19, 19-20, 19-21, 19-22, 19-30, 19-60, 20-20, 20-21, 20-22, 20-30, 20-60, 21-21, 21-22, 21-30, 21-60, 22-22, 22-30, 22-60, 30-30, 30-60, 60-60

I am wondering if there is a more efficient way to get an average like this.

Or if someone has a better algorithm all together?

like image 354
JD Isaacks Avatar asked Oct 19 '10 19:10

JD Isaacks


People also ask

How do you find the average without an outlier?

The Excel TRIMMEAN function calculates mean (average) while excluding outliers. The number of data points to exclude is provided as a percentage. It's important to note that TRIMMEAN rounds excluded data points down to the nearest multiple of 2. For example, with 50 data points, 10% equals 5 values.

How do you calculate average?

Average This is the arithmetic mean, and is calculated by adding a group of numbers and then dividing by the count of those numbers. For example, the average of 2, 3, 3, 5, 7, and 10 is 30 divided by 6, which is 5. Median The middle number of a group of numbers.

How do you find the average time of day?

The program sums the hours, minutes, and seconds of all the given times and then divides them by the total number of times via the formula "(time1 + time2 + time3)÷3". For example, the average value of times 10:00:00 and 15:00:00 is 12:30:00.


3 Answers

You might get some use out of standard deviation here, which basically measures how concentrated the data points are. You can define an outlier as anything more than 1 standard deviation (or whatever other number suits you) from the average, throw them out, and calculate a new average that doesn't include them.

like image 185
grossvogel Avatar answered Nov 01 '22 11:11

grossvogel


Here's a pretty naive implementation that you could fix up for your own needs. I purposely kept it pretty verbose. It's based on the five-number-summary often used to figure these things out.

function get_median($arr) {
    sort($arr);
    $c = count($arr) - 1;
    if ($c%2) {
        $b = round($c/2);
        $a = $b-1;
        return ($arr[$b] + $arr[$a]) / 2 ;
    } else {
        return $arr[($c/2)];
    }
}

function get_five_number_summary($arr) {
    sort($arr);
    $c = count($arr) - 1;
    $fns = array();
    if ($c%2) {
        $b = round($c/2);
        $a = $b-1;
        $lower_quartile = array_slice($arr, 1, $a-1);
        $upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
        $fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
        return $fns;
    }
    else {
        $b = round($c/2);
        $a = $b-1;
        $lower_quartile = array_slice($arr, 1, $a);
        $upper_quartile = array_slice($arr, $b+1, count($lower_quartile));
        $fns = array($arr[0], get_median($lower_quartile), get_median($arr), get_median($upper_quartile), $arr[$c-1]);
        return $fns;
    }
}

function find_outliers($arr) {
    $fns = get_five_number_summary($arr);
    $interquartile_range = $fns[3] - $fns[1];
    $low = $fns[1] - $interquartile_range;
    $high = $fns[3] + $interquartile_range;
    foreach ($arr as $v) {
        if ($v > $high || $v < $low)
            echo "$v is an outlier<br>";
    }
}

//$numbers = array( 19,20,21,21,22,30,60 ); // 60 is an outlier
$numbers = array( 1,230,239,331,340,800); // 1 is an outlier, 800 is an outlier
find_outliers($numbers);

Note that this method, albeit much simpler to implement than standard deviation, will not find the two 60 outliers in your example, but it works pretty well. Use the code for whatever, hopefully it's useful!

To see how the algorithm works and how I implemented it, go to: http://www.mathwords.com/o/outlier.htm

This, of course, doesn't calculate the final average, but it's kind of trivial after you run find_outliers() :P

like image 25
David Titarenco Avatar answered Nov 01 '22 10:11

David Titarenco


Why don't you use the median? It's not 30, it's 21.5.

like image 1
Mike C Avatar answered Nov 01 '22 11:11

Mike C