Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Returning the middle n (values not index) from a collection

Tags:

c#

algorithm

linq

I have a List<int> and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

For instance, given the following list if I wanted the middle 80% i would expect that the 11 and 100 would be removed.

11,22,22,33,44,44,55,55,55,100.

Is there an easy / built in way to do this in LINQ?

like image 382
will Avatar asked Apr 18 '11 16:04

will


1 Answers

I have a List<int> and i need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

Removing outliers correctly depends entirely on the statistical model that accurately describes the distribution of the data -- which you have not supplied for us.

On the assumption that it is a normal (Gaussian) distribution, here's what you want to do.

First compute the mean. That's easy; it's just the sum divided by the number of items.

Second, compute the standard deviation. Standard deviation is a measure of how "spread out" the data is around the mean. Compute it by:

  • take the difference of each point from the mean
  • square the difference
  • take the mean of the squares -- this is the variance
  • take the square root of the variance -- this is the standard deviation

In a normal distribution 80% of the items are within 1.2 standard deviations of the mean. So, for example, suppose the mean is 50 and the standard deviation is 20. You would expect that 80% of the sample would fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. You can then filter out items from the list that are outside of that range.

Note however that this is not removing "outliers". This is removing elements that are more than 1.2 standard deviations from the mean, in order to get an 80% interval around the mean. In a normal distribution one expects to see "outliers" on a regular basis. 99.73% of items are within three standard deviations of the mean, which means that if you have a thousand observations, it is perfectly normal to see two or three observations more than three standard deviations outside the mean! In fact, anywhere up to, say, five observations more than three standard deviations away from the mean when given a thousand observations probably does not indicate an outlier.

I think you need to very carefully define what you mean by outlier and describe why you are attempting to eliminate them. Things that look like outliers are potentially not outliers at all, they are real data that you should be paying attention to.

Also, note that none of this analysis is correct if the normal distribution is incorrect! You can get into big, big trouble eliminating what look like outliers when in fact you've actually got the entire statistical model wrong. If the model is more "tail heavy" than the normal distribution then outliers are common, and not actually outliers. Be careful! If your distribution is not normal then you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.

like image 119
Eric Lippert Avatar answered Nov 11 '22 23:11

Eric Lippert