I would like to add random NA
to a data.frame in R. So far I've looked into these questions:
R: Randomly insert NAs into dataframe proportionaly
How do I add random NA
s into a data frame
add random missing values to a complete data frame (in R)
Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:
Anyone has an idea? I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4. Thanks.
[note] the exact proportion, rounded at +/- 1NA of course.
This is the way that I do it for my paper on library(imputeMulti)
which is currently in review at JSS. This inserts NA
's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0
.
createNAs <- function (x, pctNA = 0.1) {
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
}
Obviously you should use a random seed for reproducibility, which can be specified before the function call.
This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.
Edit: I do assume that x
is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)
Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA
on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:
I modify his createNAs
function so that it enables one of 3 options:
For option 1 and 2, the function will work recursively until reached the desired proportion of NA
:
createNAs <- function (x, pctNA = 0.0, option = "add"){
prop.NA = function(x) sum(is.na(x))/prod(dim(x))
initial.pctNA = prop.NA(x)
if ( (option =="complement") & (initial.pctNA > pctNA) ){
message("The data already had more NA than the target percentage. Returning original data")
return(x)
}
if ( (option == "none") || (initial.pctNA == 0) ){
n <- nrow(x)
p <- ncol(x)
NAloc <- rep(FALSE, n * p)
NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
x[matrix(NAloc, nrow = n, ncol = p)] <- NA
return(x)
} else { # if another option than none:
target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
while (prop.NA(x) < target) {
prop.remaining.to.add = target - prop.NA(x)
x = createNAs(x, prop.remaining.to.add, option = "none")
}
return(x)
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With