Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to calculate sum or return NA if all values are NA

Tags:

r

na

sum

During a simulation I created multiple data sets with > 1,000,000 variables. However, some of the values of these variables are NA and in some cases even all values are NA. Now I'd like to calculate the sum of all values of the variables but want to get NA if all values are NA.

The problem with the common sum(x, na.rm=T) or sum(na.omit(x)) is, that it returns 0 if all values are NA. Thus, I've written my own function that deals with NA in the expected way:

sumna <- function(x) {
  sumna <- NULL
  return(ifelse(all(is.na(x)), NA, sum(na.omit(x))))
}

However, that implementation is rather slow.

Thus, I'm looking for an implementation or pre-implemented function that sums up values of a vector, omits NA and returns NA if all values are NA.

Many thanks in advance!

like image 590
Anti Avatar asked Nov 30 '22 21:11

Anti


1 Answers


The sum_ from hablar have the same behavior as the OP wanted. So, no need to reinvent the wheel

library(hablar)
sum_(c(1:10, NA))
#[1] 55
sum_(c(NA, NA, NA))
#[1] NA

and it can be used in tidyverse or data.table

library(dplyr)
df1 %>%
    summarise_all(sum_)

But, if we need to change the OP's custom function, instead of ifelse, a better option is if/else

sumna <- function(x) {
       if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
   }

Also, we can use the vectorized colSums

v1 <- colSums(df1, na.rm = TRUE)
v1[colSums(is.na(df1)) == nrow(df1)] <- NA

As the dataset is huge, we can also make use of the efficient data.table

library(data.table)
setDT(df1)[, lapply(.SD, sumna)]

Or using tidyverse

library(tidyverse)
df1 %>%
     summarise_all(sumna)
like image 159
akrun Avatar answered Dec 10 '22 10:12

akrun