I have a dataset:
df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
"male"), Division = c("South Atlantic", "East North Central",
"Pacific", "East North Central", "South Atlantic", "South Atlantic",
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I need to perform an analysis such that I can't have NA values in the gender variable. The other columns are too few and have no known predictive value so that imputing the values isn't really possible.
I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female or male into the missing cases.
Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NAs with female or male in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NAs?
We can use ifelse and is.na to determine if na exist, and then use sample to randomly select female and male.
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
How about this:
> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
+ "male"),
+ Division = c("South Atlantic", "East North Central",
+ "Pacific", "East North Central", "South Atlantic", "South Atlantic",
+ "Pacific"),
+ Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+ 107683.9118, 56149.3217, 46237.265),
+ first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+ row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
>
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
>
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
>
> df$gender
[1] "female" "male" "female" "female" "male" "male" "male"
>
Thats is random with a given probability. You could also consider imputing values using nearest neighbors, hot desk, or similar.
Hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With