Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:

0-99 Data

-1 Question not asked

-5 Do not know

-7 Refused to respond

-9 Module not asked

Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.

I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.

like image 482
Ari B. Friedman Avatar asked Mar 17 '11 06:03

Ari B. Friedman


People also ask

How do you handle missing values in a data set in R?

nan() Function for Finding Missing values: A logical vector is returned by this function that indicates all the NaN values present. It returns a Boolean value. If NaN is present in a vector it returns TRUE else FALSE.

How does R handle missing values explain with an example?

To see which values in each of these vectors R recognizes as missing, we can use the is.na function. It will return a TRUE/FALSE vector with as any elements as the vector we provide. We can see that R distinguishes between the NA and “NA” in x2–NA is seen as a missing value, “NA” is not.

How do I not use missing values in R?

First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.


2 Answers

I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.

A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.

eg :

NACode <- function(x,code){
    Df <- sapply(x,function(i){
        i[i %in% code] <- NA
        i
    })

    id <- which(is.na(Df))
    rowid <- id %% nrow(x)
    colid <- id %/% nrow(x) + 1
    NAdf <- data.frame(
        id,rowid,colid,
        value = as.matrix(x)[id]
    )
    Df <- as.data.frame(Df)
    attr(Df,"NAcode") <- NAdf
    Df
}

This allows to do :

> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA NA NA 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :

ChangeNAToCode <- function(x,code){
    NAval <- attr(x,"NAcode")
    for(i in which(NAval$value %in% code))
        x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]

    x
}

> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame':   10 obs. of  2 variables:
 $ A: num  1 2 3 4 5 6 7 8 9 10
 $ B: num  1 2 3 4 5 NA -2 -3 9 10
 - attr(*, "NAcode")='data.frame':      3 obs. of  4 variables:
  ..$ id   : int  16 17 18
  ..$ rowid: int  6 7 8
  ..$ colid: num  2 2 2
  ..$ value: num  -1 -2 -3

This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.

But in one line : using attributes and indices might be a nice way of doing it.

like image 161
Joris Meys Avatar answered Oct 24 '22 18:10

Joris Meys


The most obvious way seems to use two vectors:

  • Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
  • Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.

Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.

Update following questions from @gsk3

  1. Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
  2. Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
  3. now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
  4. There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.

I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.

like image 24
csgillespie Avatar answered Oct 24 '22 19:10

csgillespie