I have a dataset:
df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
"male"), Division = c("South Atlantic", "East North Central",
"Pacific", "East North Central", "South Atlantic", "South Atlantic",
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538,
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn",
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I need to perform an analysis such that I can't have NA
values in the gender
variable. The other columns are too few and have no known predictive value so that imputing the values isn't really possible.
I can perform the analysis by removing the incomplete observations entirely - they are about 4% of the dataset, but I'd like to see the results by randomly assigning female
or male
into the missing cases.
Other than writing some pretty ugly code to filter to just incomplete cases, split in two and replace NA
s with female
or male
in each half, I wondered if there was an elegant way to randomly or proportionally assign values into NA
s?
We can use ifelse
and is.na
to determine if na
exist, and then use sample
to randomly select female
and male
.
df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)
How about this:
> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male",
+ "male"),
+ Division = c("South Atlantic", "East North Central",
+ "Pacific", "East North Central", "South Atlantic", "South Atlantic",
+ "Pacific"),
+ Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+ 107683.9118, 56149.3217, 46237.265),
+ first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+ row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
>
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
>
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
>
> df$gender
[1] "female" "male" "female" "female" "male" "male" "male"
>
Thats is random with a given probability. You could also consider imputing values using nearest neighbors, hot desk, or similar.
Hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With