I have a categorical variable with three levels (A
, B
, and C
).
I also have a continuous variable with some missing values on it.
I would like to replace the NA
values with the mean of its group. This is, missing observations from group A
has to be replaced with the mean of group A
.
I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.
A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)
Now, I understand I could do the same for group B
and C
, but perhaps a for
loop (with if
and else
) might do the trick?
Problem #1: Mean imputation does not preserve the relationships among variables. True, imputing the mean preserves the mean of the observed data. So if the data are missing completely at random, the estimate of the mean remains unbiased.
Mean imputation (MI) is one such method in which the mean of the observed values for each variable is computed and the missing values for that variable are imputed by this mean. This method can lead into severely biased estimates even if data are MCAR (see, e.g., Jamshidian and Bentler, 1999).
Mean imputation is typically considered terrible practice since it ignores feature correlation.
require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))
For a faster, base-R version, you can use ave
:
data$variable<-ave(data$variable,data$group,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With