Conditional calculation of mean

Question

I would like to calculate means of a data frame for each factor, but only if certain condition is met. For example, I have this data frame:

> df <- data.frame(name = rep(c("A", "B", "C"), 20), 
                 variable1 = rep(c(1, 1, 1, 1, 1, NA), 10), 
                 variable2 = rep(c(2, NA, 2, 2, 2, 2), 10), 
                 variable3 = rep(c(NA, 3, 3, 3, 3, 3), 10))

> table(df$name, df$variable1)

     1
  A 20
  B 20
  C 10

> table(df$name, df$variable2)

     2
  A 20
  B 10
  C 20

> table(df$name, df$variable3)

     3
  A 10
  B 20
  C 20

I can easily calculate means for each variable with dplyr in a following way:

means <- df %>%
  group_by(name) %>%
  summarise_each(funs(mean(., na.rm = TRUE)))
> means
Source: local data frame [3 x 4]

  name variable1 variable2 variable3
1    A         1         2         3
2    B         1         2         3
3    C         1         2         3

However, what I want is to only calculate means if there are 11 or more observations, otherwise put NA's in respective cells. In other words, the result should look like this:

  name variable1 variable2 variable3
1    A         1         2        NA
2    B         1        NA         3
3    C        NA         2         3

akrun · Accepted Answer

We could create a function (f1) which will do the mean if the number of non-NA values is greater than 11 (sum(!is.na(x))>11) or else we get NA. I am using NA_real_ as the default NA would be of logical class and there may be clash of class in some packages.

Using dplyr, we group by name and use the function (f1) within summarise_each

f1 <- function(x) if(sum(!is.na(x))>11) mean (x, na.rm=TRUE) else NA_real_
library(dplyr) 
df  %>% 
     group_by(name) %>%
     summarise_each(funs(f1))

Or a similar approach using data.table would be to convert the data.frame to data.table (setDT(df)). We loop through the columns (lapply(.SD, ..)), use the same function grouped by name

library(data.table)
setDT(df)[,  lapply(.SD, f1), by = name]
#   name variable1 variable2 variable3
#1:    A         1         2        NA
#2:    B         1        NA         3
#3:    C        NA         2         3

mpalanco · Answer

An alternative transforming previously the data from wide to long.

library(reshape2)
df1 <- melt(df, id.vars = c("name")) # From wide to long
df1 <- df1 %>% group_by(name, variable) %>% 
       filter(n()>10) %>%
       summarize(mean = mean(value))

Long format output:

   name  variable mean
1    A variable1    1
2    A variable2    2
3    A variable3   NA
4    B variable1    1
5    B variable2   NA
6    B variable3    3
7    C variable1   NA
8    C variable2    2
9    C variable3    3

Wide format output:

dcast(df1,  name ~ variable, value.var = "mean")

  name variable1 variable2 variable3
1    A         1         2        NA
2    B         1        NA         3
3    C        NA         2         3

Conditional calculation of mean

Tags:

r

dplyr

Mikhail Balyasin

2 Answers

akrun

mpalanco

Recent Activity

Donate For Us

Conditional calculation of mean

Tags:

r

dplyr

Mikhail Balyasin

2 Answers

akrun

mpalanco

Related questions

Recent Activity

Donate For Us