Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.

What are the potential algos for the purpose? Accuracy is a matter of concern.

I am very new to Stats, so need help in very basic algos.

like image 930
Ashish Agarwal Avatar asked Jan 15 '10 06:01

Ashish Agarwal


People also ask

Which is the best method for removing outliers in a data set?

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.

Which algorithm can handle outliers?

In this article, we have seen 3 different methods for dealing with outliers: the univariate method, the multivariate method, and the Minkowski error. These methods are complementary, and we might need to try them all if our data set has many severe outliers.

Which of the following measure can handle outliers efficiently?

The use of Least Absolute Deviations or L1-Norm Method for fitting data with possible outliers is much more effective in dealing with data outliers than those methods based on the Least Squares Method.


2 Answers

Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:

  1. A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
  2. The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.

There are a few good ways to proceed:

  1. Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.

  2. Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.

  3. If you have a small dataset, you could just plot your data and examine it manually for implausible values.

  4. If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.

like image 56
dsimcha Avatar answered Nov 27 '22 17:11

dsimcha


Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).

Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).

like image 20
eric.a.booth Avatar answered Nov 27 '22 15:11

eric.a.booth