Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing univariate outliers from data frame (+-3 SDs)

Tags:

r

outliers

I'm so new to R that I'm having trouble finding what I need in other peoples' questions. I think my question is so easy that nobody else has bothered to ask it.

What would be the simplest code to create a new data frame which excludes data which are univariate outliers(which I'm defining as points which are 3 SDs from their condition's mean), within their condition, on a certain variable?

I'm embarrassed to show what I've tried but here it is

greaterthan <- mean(dat$var2[dat$condition=="one"]) + 
               2.5*(sd(dat$var2[dat$condition=="one"]))
lessthan    <- mean(dat$var2[dat$condition=="one"]) -
               2.5*(sd(dat$var2[dat$condition=="one"]))   

withoutliersremovedone1 <-dat$var2[dat$condition=="one"] < greaterthan

and I'm pretty much already stuck there.

Thanks

like image 958
luke123 Avatar asked Dec 20 '22 10:12

luke123


1 Answers

> dat <- data.frame(
                    var1=sample(letters[1:2],10,replace=TRUE),
                    var2=c(1,2,3,1,2,3,102,3,1,2)
                   )
> dat
   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3
7     a  102 #outlier
8     b    3
9     b    1
10    a    2

Now only return those rows which are not (!) greater than 2 absolute sd's from the mean of the variable in question. Obviously change 2 to however many sd's you want to be the cutoff.

> dat[!(abs(dat$var2 - mean(dat$var2))/sd(dat$var2)) > 2,]
   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3 # no outlier
8     b    3 # between here
9     b    1
10    a    2

Or more short-hand using the scale function:

dat[!abs(scale(dat$var2)) > 2,]

   var1 var2
1     b    1
2     a    2
3     a    3
4     a    1
5     b    2
6     b    3
8     b    3
9     b    1
10    a    2

edit

This can be extended to looking within groups using by

do.call(rbind,by(dat,dat$var1,function(x) x[!abs(scale(x$var2)) > 2,] ))

This assumes dat$var1 is your variable defining the group each row belongs to.

like image 187
thelatemail Avatar answered Jan 21 '23 05:01

thelatemail