Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding two variables with missing data

Tags:

r

missing-data

This is probably a really simple question for regular R users but I can't seem to find the solution. I want to add two variables with missing data.

x1<-c(NA,3,NA,5) x2<-c(NA,NA,4,3) x3<-x1+x2 x3 [1] NA NA NA 8

But what I really want is:

[1] NA 3 4 8

Any suggestions would be much appreciated. How can I keep the NA's?

like image 796
swhusky Avatar asked Feb 19 '15 00:02

swhusky


2 Answers

To keep the NA if both are NA (ripping off @Ben Bolker's approach of using cbind):

apply(cbind(x1, x2), 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# [1] NA  3  4  8

Or, if you prefer using the rowSums function (which is attractive because it's vectorized whereas the apply and mapply solutions are not):

rowSums(cbind(x1, x2), na.rm=T) + ifelse(is.na(x1) & is.na(x2), NA, 0)
# [1] NA  3  4  8

Neither of these would be quite as fast as an Rcpp function (which would only need to loop through the two inputs once):

library(Rcpp)
sum.na.ign <- cppFunction("
NumericVector sumNaIgn(NumericVector x, NumericVector y) {
  const int n = x.size();
  NumericVector out(n);
  for (int i=0; i < n; ++i) {
    if (R_IsNA(x[i])) {
      out[i] = y[i];
    } else if (R_IsNA(y[i])) {
      out[i] = x[i];
    } else {
      out[i] = x[i] + y[i];
    }
  }
  return out;
}")
sum.na.ign(x1, x2)
# [1] NA  3  4  8

We can benchmark (along with the solution based on mapply from @J. Won.) for larger vectors:

# First two functions along with mapply-based solution from @J. Won.
f1 <- function(x1, x2) apply(cbind(x1, x2), 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
f2 <- function(x1, x2) rowSums(cbind(x1, x2), na.rm=T) + ifelse(is.na(x1) & is.na(x2), NA, 0)
NAsum <- function(...) {
  if(any(!is.na(c(...)))) return(sum(..., na.rm=TRUE))
  return(NA)
}
jwon <- function(x1, x2) mapply(NAsum, x1, x2)

set.seed(144)
x1 <- sample(c(NA, 1:10), 10000, replace=T)
x2 <- sample(c(NA, 1:10), 10000, replace=T)
all.equal(jwon(x1, x2), f1(x1, x2), f2(x1, x2), sum.na.ign(x1, x2))
# [1] TRUE
library(microbenchmark)
microbenchmark(jwon(x1, x2), f1(x1, x2), f2(x1, x2), sum.na.ign(x1, x2))
# Unit: microseconds
#                expr       min         lq       mean     median        uq       max neval
#        jwon(x1, x2) 24044.658 28387.4280 35580.3434 35134.9940 38175.661 91476.032   100
#          f1(x1, x2) 37516.769 46664.6390 52293.5265 51570.2690 56647.063 77576.091   100
#          f2(x1, x2)  2588.820  2738.0740  2930.4106  2833.4880  2974.745  5187.684   100
#  sum.na.ign(x1, x2)    97.988   109.8575   132.9849   123.0795   142.725   533.275   100

The rowSums solution is vectorized and therefore faster than the apply and mapply solutions (these would feel slow with vectors of length 1 million) but the custom Rcpp solution is more than 10x faster than the rowSums approach. Your vectors would probably need to be pretty large for the Rcpp to be useful compared to the rowSums.

like image 126
josliber Avatar answered Sep 27 '22 17:09

josliber


mapply(sum, x1, x2, na.rm=TRUE)

EDIT: if we want a more complicated version as requested in comment, I think it requires a custom function

NAsum <- function(...) {
  if(any(!is.na(c(...)))) return(sum(..., na.rm=TRUE))
  return(NA)
}

mapply(NAsum, x1, x2)
like image 43
J. Win. Avatar answered Sep 27 '22 19:09

J. Win.