Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning categorical values to NAs randomly or proportionally

Tags:

r

na

I have a dataset:

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
"male"), Division = c("South Atlantic", "East North Central", 
"Pacific", "East North Central", "South Atlantic", "South Atlantic", 
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538, 
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn", 
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

I need to perform an analysis such that I can't have NA values in the gender variable. The other columns are too few and have no known predictive value so that imputing the values isn't really possible.

I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female or male into the missing cases.

Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NAs with female or male in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NAs?

like image 502
nycrefugee Avatar asked Feb 23 '19 20:02

nycrefugee


2 Answers

We can use ifelse and is.na to determine if na exist, and then use sample to randomly select female and male.

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
like image 134
www Avatar answered Dec 17 '22 13:12

www


How about this:

> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
+                                 "male"),
+                      Division = c("South Atlantic", "East North Central", 
+                                   "Pacific", "East North Central", "South Atlantic", "South Atlantic", 
+                                   "Pacific"),
+                      Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+                                 107683.9118, 56149.3217, 46237.265),
+                      first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+                 row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
> 
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
> 
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
> 
> df$gender
[1] "female" "male"   "female" "female" "male"   "male"   "male"  
> 

Thats is random with a given probability. You could also consider imputing values using nearest neighbors, hot desk, or similar.

Hope it helps.

like image 39
Santiago Capobianco Avatar answered Dec 17 '22 13:12

Santiago Capobianco