Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

add exact proportion of random missing values to data.frame

I would like to add random NA to a data.frame in R. So far I've looked into these questions:

R: Randomly insert NAs into dataframe proportionaly

How do I add random NAs into a data frame

add random missing values to a complete data frame (in R)

Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:

  • Add really random NA, and not the same amount by row or by column
  • Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
  • Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
  • Is computationnaly efficient for big datasets.
  • Add the proportion/number of NA independently of already present NA in the input.

Anyone has an idea? I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4. Thanks.

[note] the exact proportion, rounded at +/- 1NA of course.

like image 663
agenis Avatar asked Sep 15 '16 14:09

agenis


2 Answers

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

like image 195
Alex W Avatar answered Sep 19 '22 01:09

Alex W


Some users reported that Alex's answer did not address condition N°5 of my question. Indeed, when adding random NA on a dataframe that already contains missing values, the new ones will sometimes fall on the initial ones, and the final proportion will be somewhere between initial proportion and desired proportion... So I expand on Alex's function to comply with all 5 conditions:

I modify his createNAs function so that it enables one of 3 options:

  • option complement: complement with NA up to the desired %
  • option add : add % of NA in addition to those already present
  • option none : add a % of NA regardless of those already present

For option 1 and 2, the function will work recursively until reached the desired proportion of NA:

createNAs <- function (x, pctNA = 0.0, option = "add"){
  prop.NA = function(x) sum(is.na(x))/prod(dim(x))
  initial.pctNA = prop.NA(x)

  if (  (option =="complement") & (initial.pctNA > pctNA)  ){
    message("The data already had more NA than the target percentage. Returning original data")
    return(x)
  }

  if (  (option == "none") || (initial.pctNA == 0)  ){
    n <- nrow(x)
    p <- ncol(x)
    NAloc <- rep(FALSE, n * p)
    NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
    x[matrix(NAloc, nrow = n, ncol = p)] <- NA
    return(x)
  } else { # if another option than none:
    target = ifelse(option=="complement", pctNA, pctNA + initial.pctNA)
    while (prop.NA(x) < target) {
      prop.remaining.to.add = target - prop.NA(x)
      x = createNAs(x, prop.remaining.to.add, option = "none")
    }
    return(x)
  }
}
like image 25
agenis Avatar answered Sep 20 '22 01:09

agenis