Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the mean for a set of numbers while neglecting outliers

Tags:

c++

math

First of all this is more of a math question than it is a coding one, so please be patient. I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:

Lets say I have a set of numbers that are similar to the following:

{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }

it is clear for the set above that the majority of numbers lies between 90 and 99, however I have some outliers like { 300, 400, 2, 3 }. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.

Will appreciate any help..

Thanks

like image 634
Zaid Amir Avatar asked Jun 01 '11 11:06

Zaid Amir


People also ask

What happens to the mean when the outlier is removed?

Changing the divisor: When determining how an outlier affects the mean of a data set, the student must find the mean with the outlier, then find the mean again once the outlier is removed. Removing the outlier decreases the number of data by one and therefore you must decrease the divisor.

Why should outliers be ignored when calculating mean?

Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.

Do you include outliers when calculating the mean?

In most cases, outliers have influence on mean , but not on the median , or mode . Therefore, the outliers are important in their effect on the mean. There is no rule to identify the outliers.


1 Answers

What you could do is:

  1. estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
  2. compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
  3. estimate the mean between the first and the last quantile.

PS: Outliers constituting 25% of your dataset is a lot!

PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:enter image description here

like image 88
Wok Avatar answered Sep 29 '22 22:09

Wok