Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace NA with mean based on row subset matching another column?

Tags:

dataframe

r

I have data where each row contains a person's gender and weight (in lbs):

genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE")
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA)

df <- data.frame(gender=genders, weight=weights)
df
#   gender weight
# 1 FEMALE    110
# 2 FEMALE    120
# 3 FEMALE    112
# 4 FEMALE     NA
# 5 FEMALE     NA
# 6   MALE    190
# 7   MALE    202
# 8   MALE    195
# 9   MALE     NA

For every row that has an NA in the weight column, I would like to replace / impute the NA with the weight mean, but the mean should be calculated using only the rows that match the same gender value as the row with the NA.

Specifically, rows 4 and 5 have a gender of FEMALE and weight of NA. I want to replace the NA with the mean weight computed over the subset of rows that match the gender of FEMALE. In this case, the mean would be (110+120+112)/3=114.0 from the other rows 1, 2, and 3.

Likewise, I want to replace the NA in row 9 with the mean of the weights for MALE gender.

I tried the following command, but it replaced the NA with the mean weight over all users across both genders, which is not what I want.

df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T)
df
#   gender   weight
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 154.8333
# 5 FEMALE 154.8333
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 154.8333

I searched other questions, but they are not quite the same problem as mine:

"Replace NA with mean matching the same ID"

"How to replace NA with mean by subset in R (impute with plyr?)"

"How to replace NA values in a table for selected columns? data.frame, data.table"

like image 304
stackoverflowuser2010 Avatar asked Dec 05 '22 17:12

stackoverflowuser2010


2 Answers

You could use ave() with replace() (or standard manual replacement).

df$weight <- with(df, ave(weight, gender,
    FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))))

which gives

  gender   weight
1 FEMALE 110.0000
2 FEMALE 120.0000
3 FEMALE 112.0000
4 FEMALE 114.0000
5 FEMALE 114.0000
6   MALE 190.0000
7   MALE 202.0000
8   MALE 195.0000
9   MALE 195.6667
like image 52
Rich Scriven Avatar answered Jan 17 '23 15:01

Rich Scriven


You can group your data frame by gender and then calculate the average of weight and replace the NA with ifelse statement, in dplyr, it could be:

library(dplyr)
df %>% 
      group_by(gender) %>% 
      mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight))

# Source: local data frame [9 x 2]
# Groups: gender [2]

#  gender   weight
#  <fctr>    <dbl>
# 1 FEMALE 110.0000
# 2 FEMALE 120.0000
# 3 FEMALE 112.0000
# 4 FEMALE 114.0000
# 5 FEMALE 114.0000
# 6   MALE 190.0000
# 7   MALE 202.0000
# 8   MALE 195.0000
# 9   MALE 195.6667
like image 38
Psidom Avatar answered Jan 17 '23 16:01

Psidom