Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement coalesce efficiently in R

Tags:

r

coalesce

Background

Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.

I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.

I have made a naive implementation myself but it is ridiculously slow.

coalesce <- function(...) {   apply(cbind(...), 1, function(x) {           x[which(!is.na(x))[1]]         }) } 

Example

a <- c(1,  2,  NA, 4, NA) b <- c(NA, NA, NA, 5, 6) c <- c(7,  8,  NA, 9, 10) coalesce(a,b,c) # [1]  1  2 NA  4  6 

Question

Is there any efficient way to implement coalesce in R?

like image 776
while Avatar asked Oct 08 '13 16:10

while


2 Answers

On my machine, using Reduce gets a 5x performance improvement:

coalesce2 <- function(...) {   Reduce(function(x, y) {     i <- which(is.na(x))     x[i] <- y[i]     x},   list(...)) }  > microbenchmark(coalesce(a,b,c),coalesce2(a,b,c)) Unit: microseconds                expr    min       lq   median       uq     max neval   coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100  coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100 
like image 160
mrip Avatar answered Sep 18 '22 07:09

mrip


Looks like coalesce1 is still available

coalesce1 <- function(...) {     ans <- ..1     for (elt in list(...)[-1]) {         i <- is.na(ans)         ans[i] <- elt[i]     }     ans } 

which is faster still (but more-or-less a hand re-write of Reduce, so less general)

> identical(coalesce(a, b, c), coalesce1(a, b, c)) [1] TRUE > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c)) Unit: microseconds                expr     min       lq   median       uq     max neval   coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100  coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100  coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100 

Or for larger data compare

coalesce1a <- function(...) {     ans <- ..1     for (elt in list(...)[-1]) {         i <- which(is.na(ans))         ans[i] <- elt[i]     }     ans } 

showing that which() can sometimes be effective, even though it implies a second pass through the index.

> aa <- sample(a, 100000, TRUE) > bb <- sample(b, 100000, TRUE) > cc <- sample(c, 100000, TRUE) > microbenchmark(coalesce1(aa, bb, cc), +                coalesce1a(aa, bb, cc), +                coalesce2(aa,bb,cc), times=10) Unit: milliseconds                    expr       min        lq    median        uq       max neval   coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10  coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10   coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10 
like image 42
Martin Morgan Avatar answered Sep 20 '22 07:09

Martin Morgan