Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for and remove outliers from a dataframe grouped by a variable

I have a data frame that has 5 variables and 800 rows:

head(df)
       V1 variable    value element OtolithNum
1 24.9835       V7 130230.0      Mg         25
2 24.9835       V8 145844.0      Mg         25
3 24.9835       V9 126126.0      Mg         25
4 24.9835      V10 103152.0      Mg         25
5 24.9835      V11 129571.9      Mg         25
6 24.9835      V12 114214.0      Mg         25

I need to perform the following:

  1. identify all values (from the "value" variable) that are > 2 Standard Deviations from the median, grouped by the element variable.
  2. remove the outliers from the dataframe (or create a new dataframe with the outliers excluded.

I have been using dplyr package and have used the following code to group by the "element" variable, and provide the mean values:

df1=df %>%
  group_by(element) %>%
  summarise_each(funs(mean), value)

Can you please help me manipulate or add to the code above in order to remove outliers (defined above, as >2 sd from the median) grouped by the "element" variable, before I extract the means.

I have tried the following code from another posting (thats why the data names don't match with my personal data above), without luck:

#standardize each column (we use it in the outdet function)
   scale(dat)
#create function that looks for values > +/- 2 sd from mean
   outdet <- function(x) abs(scale(x)) >= 2
#index with the function to remove those values
   dat[!apply(sapply(dat, outdet), 1, any), ]
like image 355
Kole Stewart Avatar asked Feb 24 '15 03:02

Kole Stewart


People also ask

How do I remove an outlier from a group in R?

3) How to Remove Outliers by Group in R We use tapply() function (in which quantile() function is used) to find quantiles of each iris species. Then, we select the first (Q1) and third (Q3) quartiles of each group by using sapply() function.

How do you remove outliers from multiple columns?

How do you remove outliers from multiple columns? Step 1: Create data frame. Step 2: Define outlier function. Step 3: Apply outlier function to data frame.


1 Answers

Here's a method using base R:

element <- sample(letters[1:5], 1e4, replace=T)
value <- rnorm(1e4)
df <- data.frame(element, value)

means.without.ols <- tapply(value, element, function(x) {
  mean(x[!(abs(x - median(x)) > 2*sd(x))])
})

And using dplyr

df1 = df %>%
  group_by(element) %>%
  filter(!(abs(value - median(value)) > 2*sd(value))) %>%
  summarise_each(funs(mean), value)

Comparison of results:

> means.without.ols
           a            b            c            d            e 
-0.008059215 -0.035448381 -0.013836321 -0.013537466  0.021170663 

> df1
Source: local data frame [5 x 2]

  element        value
1       a -0.008059215
2       b -0.035448381
3       c -0.013836321
4       d -0.013537466
5       e  0.021170663
like image 51
Zelazny7 Avatar answered Oct 07 '22 18:10

Zelazny7