First of all this is more of a math question than it is a coding one, so please be patient. I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:
Lets say I have a set of numbers that are similar to the following:
{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }
it is clear for the set above that the majority of numbers lies between 90
and 99
, however I have some outliers like { 300, 400, 2, 3 }
. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.
Will appreciate any help..
Thanks
Changing the divisor: When determining how an outlier affects the mean of a data set, the student must find the mean with the outlier, then find the mean again once the outlier is removed. Removing the outlier decreases the number of data by one and therefore you must decrease the divisor.
Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.
In most cases, outliers have influence on mean , but not on the median , or mode . Therefore, the outliers are important in their effect on the mean. There is no rule to identify the outliers.
What you could do is:
PS: Outliers constituting 25% of your dataset is a lot!
PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With