Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C).

I also have a continuous variable with some missing values on it.

I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A.

I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.

A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)

Now, I understand I could do the same for group B and C, but perhaps a for loop (with if and else) might do the trick?

like image 696
Jonatan Ottino Avatar asked Mar 25 '19 20:03

Jonatan Ottino


People also ask

What are the flaws of imputing missing values with mean?

Problem #1: Mean imputation does not preserve the relationships among variables. True, imputing the mean preserves the mean of the observed data. So if the data are missing completely at random, the estimate of the mean remains unbiased.

What is meant by mean imputation for missing data?

Mean imputation (MI) is one such method in which the mean of the observed values for each variable is computed and the missing values for that variable are imputed by this mean. This method can lead into severely biased estimates even if data are MCAR (see, e.g., Jamshidian and Bentler, 1999).

Is mean imputation of missing data acceptable practice?

Mean imputation is typically considered terrible practice since it ignores feature correlation.


1 Answers

require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))

For a faster, base-R version, you can use ave:

data$variable<-ave(data$variable,data$group,FUN=function(x) 
  ifelse(is.na(x), mean(x,na.rm=TRUE), x))
like image 68
iod Avatar answered Sep 19 '22 21:09

iod