I have a data frame like this:
x
Team 01/01/2012 01/02/2012 01/03/2012 01/01/2012 01/04/2012 SD Mean
A 100 50 40 NA 30 60 80
I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,
abs(x-Mean) > 3*SD
x$count<-c(1)
(increment this value if the above condition is met).
I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.
I like to subtrack mean from each cell, and I tried this
x$diff<-sweep(x, 1, x$Mean, FUN='-')
does not seem to be working, any ideas?
One of the easiest ways to identify outliers in R is by visualizing them in boxplots. Boxplots typically show the median of a dataset along with the first and third quartiles. They also show the limits beyond which all data values are considered as outliers.
Lower range limit = Q1 – (1.5* IQR). Essentially this is 1.5 times the inner quartile range subtracting from your 1st quartile. Higher range limit = Q3 + (1.5*IQR) This is 1.5 times IQR+ quartile 3. Now if any of your data falls below or above these limits, it will be considered an outlier.
An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).
Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset. This is usually assumed as an abnormal distribution of the data values.
Get your IQR (Interquartile range) and lower/upper quartile using:
lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)
Compute the bounds for a mild outlier:
mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)
Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier
To detect extreme outliers do the same, but multiply by 3 instead:
extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)
Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier
Hope this helps
edit: was accessing 50%, not 75%
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With