Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculating the outliers in R

Tags:

r

I have a data frame like this:

x

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA         30       60  80

I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,

abs(x-Mean) > 3*SD

x$count<-c(1) (increment this value if the above condition is met).

I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.

I like to subtrack mean from each cell, and I tried this

x$diff<-sweep(x, 1, x$Mean, FUN='-')

does not seem to be working, any ideas?

like image 484
user1471980 Avatar asked Oct 12 '12 19:10

user1471980


People also ask

How do you find outliers in R programming?

One of the easiest ways to identify outliers in R is by visualizing them in boxplots. Boxplots typically show the median of a dataset along with the first and third quartiles. They also show the limits beyond which all data values are considered as outliers.

What is the formula to find outliers?

Lower range limit = Q1 – (1.5* IQR). Essentially this is 1.5 times the inner quartile range subtracting from your 1st quartile. Higher range limit = Q3 + (1.5*IQR) This is 1.5 times IQR+ quartile 3. Now if any of your data falls below or above these limits, it will be considered an outlier.

How does R boxplot determine outliers?

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

What does outlier mean in R?

Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset. This is usually assumed as an abnormal distribution of the data values.


1 Answers

Get your IQR (Interquartile range) and lower/upper quartile using:

lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)

Compute the bounds for a mild outlier:

mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier

To detect extreme outliers do the same, but multiply by 3 instead:

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier

Hope this helps

edit: was accessing 50%, not 75%

like image 114
Omar Wagih Avatar answered Oct 20 '22 23:10

Omar Wagih