Assigning categorical values to NAs randomly or proportionally

Question

I have a dataset:

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
"male"), Division = c("South Atlantic", "East North Central", 
"Pacific", "East North Central", "South Atlantic", "South Atlantic", 
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538, 
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn", 
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

I need to perform an analysis such that I can't have NA values in the gender variable. The other columns are too few and have no known predictive value so that imputing the values isn't really possible.

I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female or male into the missing cases.

Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NAs with female or male in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NAs?

www · Accepted Answer

We can use ifelse and is.na to determine if na exist, and then use sample to randomly select female and male.

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)

Santiago Capobianco · Answer

How about this:

> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
+                                 "male"),
+                      Division = c("South Atlantic", "East North Central", 
+                                   "Pacific", "East North Central", "South Atlantic", "South Atlantic", 
+                                   "Pacific"),
+                      Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+                                 107683.9118, 56149.3217, 46237.265),
+                      first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+                 row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
> 
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
> 
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
> 
> df$gender
[1] "female" "male"   "female" "female" "male"   "male"   "male"  
>

Thats is random with a given probability. You could also consider imputing values using nearest neighbors, hot desk, or similar.

Hope it helps.

Assigning categorical values to NAs randomly or proportionally

Tags:

r

na

nycrefugee

2 Answers

www

Santiago Capobianco

Recent Activity

Donate For Us

Assigning categorical values to NAs randomly or proportionally

Tags:

r

na

nycrefugee

2 Answers

www

Santiago Capobianco

Related questions

Recent Activity

Donate For Us