I would like to calculate means of a data frame for each factor, but only if certain condition is met. For example, I have this data frame:
> df <- data.frame(name = rep(c("A", "B", "C"), 20),
variable1 = rep(c(1, 1, 1, 1, 1, NA), 10),
variable2 = rep(c(2, NA, 2, 2, 2, 2), 10),
variable3 = rep(c(NA, 3, 3, 3, 3, 3), 10))
> table(df$name, df$variable1)
1
A 20
B 20
C 10
> table(df$name, df$variable2)
2
A 20
B 10
C 20
> table(df$name, df$variable3)
3
A 10
B 20
C 20
I can easily calculate means for each variable with dplyr
in a following way:
means <- df %>%
group_by(name) %>%
summarise_each(funs(mean(., na.rm = TRUE)))
> means
Source: local data frame [3 x 4]
name variable1 variable2 variable3
1 A 1 2 3
2 B 1 2 3
3 C 1 2 3
However, what I want is to only calculate means if there are 11 or more observations, otherwise put NA
's in respective cells. In other words, the result should look like this:
name variable1 variable2 variable3
1 A 1 2 NA
2 B 1 NA 3
3 C NA 2 3
We could create a function (f1
) which will do the mean if
the number of non-NA
values is greater than 11 (sum(!is.na(x))>11
) or else we get NA
. I am using NA_real_
as the default NA
would be of logical class and there may be clash of class in some packages.
Using dplyr
, we group by name
and use the function (f1
) within summarise_each
f1 <- function(x) if(sum(!is.na(x))>11) mean (x, na.rm=TRUE) else NA_real_
library(dplyr)
df %>%
group_by(name) %>%
summarise_each(funs(f1))
Or a similar approach using data.table
would be to convert the data.frame
to data.table
(setDT(df)
). We loop through the columns (lapply(.SD, ..)
), use the same function grouped by name
library(data.table)
setDT(df)[, lapply(.SD, f1), by = name]
# name variable1 variable2 variable3
#1: A 1 2 NA
#2: B 1 NA 3
#3: C NA 2 3
An alternative transforming previously the data from wide to long.
library(reshape2)
df1 <- melt(df, id.vars = c("name")) # From wide to long
df1 <- df1 %>% group_by(name, variable) %>%
filter(n()>10) %>%
summarize(mean = mean(value))
Long format output:
name variable mean
1 A variable1 1
2 A variable2 2
3 A variable3 NA
4 B variable1 1
5 B variable2 NA
6 B variable3 3
7 C variable1 NA
8 C variable2 2
9 C variable3 3
Wide format output:
dcast(df1, name ~ variable, value.var = "mean")
name variable1 variable2 variable3
1 A 1 2 NA
2 B 1 NA 3
3 C NA 2 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With