This is probably a really simple question for regular R users but I can't seem to find the solution. I want to add two variables with missing data.
x1<-c(NA,3,NA,5)
x2<-c(NA,NA,4,3)
x3<-x1+x2
x3
[1] NA NA NA 8
But what I really want is:
[1] NA 3 4 8
Any suggestions would be much appreciated. How can I keep the NA's?
To keep the NA
if both are NA
(ripping off @Ben Bolker's approach of using cbind
):
apply(cbind(x1, x2), 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
# [1] NA 3 4 8
Or, if you prefer using the rowSums
function (which is attractive because it's vectorized whereas the apply
and mapply
solutions are not):
rowSums(cbind(x1, x2), na.rm=T) + ifelse(is.na(x1) & is.na(x2), NA, 0)
# [1] NA 3 4 8
Neither of these would be quite as fast as an Rcpp function (which would only need to loop through the two inputs once):
library(Rcpp)
sum.na.ign <- cppFunction("
NumericVector sumNaIgn(NumericVector x, NumericVector y) {
const int n = x.size();
NumericVector out(n);
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
out[i] = y[i];
} else if (R_IsNA(y[i])) {
out[i] = x[i];
} else {
out[i] = x[i] + y[i];
}
}
return out;
}")
sum.na.ign(x1, x2)
# [1] NA 3 4 8
We can benchmark (along with the solution based on mapply
from @J. Won.) for larger vectors:
# First two functions along with mapply-based solution from @J. Won.
f1 <- function(x1, x2) apply(cbind(x1, x2), 1, function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm=T)))
f2 <- function(x1, x2) rowSums(cbind(x1, x2), na.rm=T) + ifelse(is.na(x1) & is.na(x2), NA, 0)
NAsum <- function(...) {
if(any(!is.na(c(...)))) return(sum(..., na.rm=TRUE))
return(NA)
}
jwon <- function(x1, x2) mapply(NAsum, x1, x2)
set.seed(144)
x1 <- sample(c(NA, 1:10), 10000, replace=T)
x2 <- sample(c(NA, 1:10), 10000, replace=T)
all.equal(jwon(x1, x2), f1(x1, x2), f2(x1, x2), sum.na.ign(x1, x2))
# [1] TRUE
library(microbenchmark)
microbenchmark(jwon(x1, x2), f1(x1, x2), f2(x1, x2), sum.na.ign(x1, x2))
# Unit: microseconds
# expr min lq mean median uq max neval
# jwon(x1, x2) 24044.658 28387.4280 35580.3434 35134.9940 38175.661 91476.032 100
# f1(x1, x2) 37516.769 46664.6390 52293.5265 51570.2690 56647.063 77576.091 100
# f2(x1, x2) 2588.820 2738.0740 2930.4106 2833.4880 2974.745 5187.684 100
# sum.na.ign(x1, x2) 97.988 109.8575 132.9849 123.0795 142.725 533.275 100
The rowSums
solution is vectorized and therefore faster than the apply
and mapply
solutions (these would feel slow with vectors of length 1 million) but the custom Rcpp solution is more than 10x faster than the rowSums
approach. Your vectors would probably need to be pretty large for the Rcpp to be useful compared to the rowSums
.
mapply(sum, x1, x2, na.rm=TRUE)
EDIT: if we want a more complicated version as requested in comment, I think it requires a custom function
NAsum <- function(...) {
if(any(!is.na(c(...)))) return(sum(..., na.rm=TRUE))
return(NA)
}
mapply(NAsum, x1, x2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With