I have a data frame like this: x <pre class="prettyprint"><code>Team 01/01/2012 01/02/2012 01/03/2012 01/01/2012 01/04/2012 SD Mean A 100 50 40 NA 30 60 80 </code></pre> I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example, <pre class="prettyprint"><code>abs(x-Mean) > 3*SD </code></pre> <code>x$count<-c(1)</code> (increment this value if the above condition is met). I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them. I like to subtrack mean from each cell, and I tried this <pre class="prettyprint"><code>x$diff<-sweep(x, 1, x$Mean, FUN='-') </code></pre> does not seem to be working, any ideas?

Get your IQR (Interquartile range) and lower/upper quartile using: <pre class="prettyprint"><code>lowerq = quantile(data)[2] upperq = quantile(data)[4] iqr = upperq - lowerq #Or use IQR(data) </code></pre> Compute the bounds for a mild outlier: <pre class="prettyprint"><code>mild.threshold.upper = (iqr * 1.5) + upperq mild.threshold.lower = lowerq - (iqr * 1.5) </code></pre> Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier To detect extreme outliers do the same, but multiply by 3 instead: <pre class="prettyprint"><code>extreme.threshold.upper = (iqr * 3) + upperq extreme.threshold.lower = lowerq - (iqr * 3) </code></pre> Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier Hope this helps edit: was accessing 50%, not 75%

calculating the outliers in R

Q: How do you find outliers in R programming?

One of the easiest ways to identify outliers in R is by visualizing them in boxplots. Boxplots typically show the median of a dataset along with the first and third quartiles. They also show the limits beyond which all data values are considered as outliers.

Q: What is the formula to find outliers?

Lower range limit = Q1 – (1.5* IQR). Essentially this is 1.5 times the inner quartile range subtracting from your 1st quartile. Higher range limit = Q3 + (1.5*IQR) This is 1.5 times IQR+ quartile 3. Now if any of your data falls below or above these limits, it will be considered an outlier.

Q: How does R boxplot determine outliers?

An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile).

Q: What does outlier mean in R?

Outliers, as the name suggests, are the data points that lie away from the other points of the dataset. That is the data values that appear away from other data values and hence disturb the overall distribution of the dataset. This is usually assumed as an abnormal distribution of the data values.

Tags:

r

I have a data frame like this:

Team 01/01/2012  01/02/2012  01/03/2012  01/01/2012 01/04/2012 SD Mean
A     100         50           40        NA         30       60  80

I like to perform calculation on each cell to the mean and sd to calculate the outliers. For example,

abs(x-Mean) > 3*SD

x$count<-c(1) (increment this value if the above condition is met).

I am doing this to check the anomaly in my data set. If I know the column names, it would be easier to do the calculations, but number of columns will vary. Some cells may have NA in them.

I like to subtrack mean from each cell, and I tried this

x$diff<-sweep(x, 1, x$Mean, FUN='-')

does not seem to be working, any ideas?

484

asked Oct 12 '12 19:10

user1471980

1 Answers

Get your IQR (Interquartile range) and lower/upper quartile using:

lowerq = quantile(data)[2]
upperq = quantile(data)[4]
iqr = upperq - lowerq #Or use IQR(data)

Compute the bounds for a mild outlier:

mild.threshold.upper = (iqr * 1.5) + upperq
mild.threshold.lower = lowerq - (iqr * 1.5)

Any data point outside (> mild.threshold.upper or < mild.threshold.lower) these values is a mild outlier

To detect extreme outliers do the same, but multiply by 3 instead:

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

Any data point outside (> extreme.threshold.upper or < extreme.threshold.lower) these values is an extreme outlier

Hope this helps

edit: was accessing 50%, not 75%

114

answered Oct 20 '22 23:10

Omar Wagih

Related questions
                            
                                ggplot multiple grouping bar
                            
                                How to get week starting date from a date in R [duplicate]
                            
                                R error "could not find function 'multiplot' " using Cookbook example
                            
                                Find which interval row in a data frame that each element of a vector belongs in
                            
                                Splitting String based on letters case
                            
                                What is the difference between these two comparisons? [duplicate]
                            
                                Implementation of skyline query or efficient frontier
                            
                                R - count all combinations
                            
                                How to interpret lm() coefficient estimates when using bs() function for splines
                            
                                Public Amazon EC2 AMIs with R pre-installed
                            
                                package doMC NOT available for R version 3.0.0 warning in install.packages
                            
                                assign a value, if a number is in between two numbers
                            
                                How to check whether R is already installed in Ubuntu? [closed]
                            
                                In R, use lubridate to convert hms objects into seconds
                            
                                How to get top n companies from a data frame in decreasing order
                            
                                writing functions vs. line-by-line interpretation in an R workflow
                            
                                Complement a DNA sequence
                            
                                How to split a data frame by rows, and then process the blocks?
                            
                                R duplicate a matrix several times and then bind by rows together
                            
                                Simulating a Random Walk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With