I have data where each row contains a person's gender and weight (in lbs):
genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE")
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA)
df <- data.frame(gender=genders, weight=weights)
df
# gender weight
# 1 FEMALE 110
# 2 FEMALE 120
# 3 FEMALE 112
# 4 FEMALE NA
# 5 FEMALE NA
# 6 MALE 190
# 7 MALE 202
# 8 MALE 195
# 9 MALE NA
For every row that has an NA in the weight
column, I would like to replace / impute the NA with the weight
mean, but the mean should be calculated using only the rows that match the same gender
value as the row with the NA.
Specifically, rows 4 and 5 have a gender
of FEMALE and weight
of NA. I want to replace the NA with the mean weight
computed over the subset of rows that match the gender
of FEMALE. In this case, the mean would be (110+120+112)/3=114.0 from the other rows 1, 2, and 3.
Likewise, I want to replace the NA in row 9 with the mean of the weights for MALE gender
.
I tried the following command, but it replaced the NA with the mean weight over all users across both genders, which is not what I want.
df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T)
df
# gender weight
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 154.8333
# 5 FEMALE 154.8333
# 6 MALE 190.0000
# 7 MALE 202.0000
# 8 MALE 195.0000
# 9 MALE 154.8333
I searched other questions, but they are not quite the same problem as mine:
"Replace NA with mean matching the same ID"
"How to replace NA with mean by subset in R (impute with plyr?)"
"How to replace NA values in a table for selected columns? data.frame, data.table"
You could use ave()
with replace()
(or standard manual replacement).
df$weight <- with(df, ave(weight, gender,
FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))))
which gives
gender weight 1 FEMALE 110.0000 2 FEMALE 120.0000 3 FEMALE 112.0000 4 FEMALE 114.0000 5 FEMALE 114.0000 6 MALE 190.0000 7 MALE 202.0000 8 MALE 195.0000 9 MALE 195.6667
You can group your data frame by gender
and then calculate the average of weight and replace the NA
with ifelse
statement, in dplyr
, it could be:
library(dplyr)
df %>%
group_by(gender) %>%
mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight))
# Source: local data frame [9 x 2]
# Groups: gender [2]
# gender weight
# <fctr> <dbl>
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 114.0000
# 5 FEMALE 114.0000
# 6 MALE 190.0000
# 7 MALE 202.0000
# 8 MALE 195.0000
# 9 MALE 195.6667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With