Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional calculation of mean

Tags:

r

dplyr

I would like to calculate means of a data frame for each factor, but only if certain condition is met. For example, I have this data frame:

> df <- data.frame(name = rep(c("A", "B", "C"), 20), 
                 variable1 = rep(c(1, 1, 1, 1, 1, NA), 10), 
                 variable2 = rep(c(2, NA, 2, 2, 2, 2), 10), 
                 variable3 = rep(c(NA, 3, 3, 3, 3, 3), 10))

> table(df$name, df$variable1)

     1
  A 20
  B 20
  C 10

> table(df$name, df$variable2)

     2
  A 20
  B 10
  C 20

> table(df$name, df$variable3)

     3
  A 10
  B 20
  C 20

I can easily calculate means for each variable with dplyr in a following way:

means <- df %>%
  group_by(name) %>%
  summarise_each(funs(mean(., na.rm = TRUE)))
> means
Source: local data frame [3 x 4]

  name variable1 variable2 variable3
1    A         1         2         3
2    B         1         2         3
3    C         1         2         3

However, what I want is to only calculate means if there are 11 or more observations, otherwise put NA's in respective cells. In other words, the result should look like this:

  name variable1 variable2 variable3
1    A         1         2        NA
2    B         1        NA         3
3    C        NA         2         3
like image 446
Mikhail Balyasin Avatar asked Aug 21 '15 08:08

Mikhail Balyasin


2 Answers

We could create a function (f1) which will do the mean if the number of non-NA values is greater than 11 (sum(!is.na(x))>11) or else we get NA. I am using NA_real_ as the default NA would be of logical class and there may be clash of class in some packages.

Using dplyr, we group by name and use the function (f1) within summarise_each

f1 <- function(x) if(sum(!is.na(x))>11) mean (x, na.rm=TRUE) else NA_real_
library(dplyr) 
df  %>% 
     group_by(name) %>%
     summarise_each(funs(f1))

Or a similar approach using data.table would be to convert the data.frame to data.table (setDT(df)). We loop through the columns (lapply(.SD, ..)), use the same function grouped by name

library(data.table)
setDT(df)[,  lapply(.SD, f1), by = name]
#   name variable1 variable2 variable3
#1:    A         1         2        NA
#2:    B         1        NA         3
#3:    C        NA         2         3
like image 102
akrun Avatar answered Oct 22 '22 17:10

akrun


An alternative transforming previously the data from wide to long.

library(reshape2)
df1 <- melt(df, id.vars = c("name")) # From wide to long
df1 <- df1 %>% group_by(name, variable) %>% 
       filter(n()>10) %>%
       summarize(mean = mean(value))

Long format output:

   name  variable mean
1    A variable1    1
2    A variable2    2
3    A variable3   NA
4    B variable1    1
5    B variable2   NA
6    B variable3    3
7    C variable1   NA
8    C variable2    2
9    C variable3    3

Wide format output:

dcast(df1,  name ~ variable, value.var = "mean")

  name variable1 variable2 variable3
1    A         1         2        NA
2    B         1        NA         3
3    C        NA         2         3
like image 2
mpalanco Avatar answered Oct 22 '22 18:10

mpalanco