Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL
elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) { apply(cbind(...), 1, function(x) { x[which(!is.na(x))[1]] }) }
a <- c(1, 2, NA, 4, NA) b <- c(NA, NA, NA, 5, 6) c <- c(7, 8, NA, 9, 10) coalesce(a,b,c) # [1] 1 2 NA 4 6
Is there any efficient way to implement coalesce
in R?
On my machine, using Reduce
gets a 5x performance improvement:
coalesce2 <- function(...) { Reduce(function(x, y) { i <- which(is.na(x)) x[i] <- y[i] x}, list(...)) } > microbenchmark(coalesce(a,b,c),coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100 coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
Looks like coalesce1 is still available
coalesce1 <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- is.na(ans) ans[i] <- elt[i] } ans }
which is faster still (but more-or-less a hand re-write of Reduce
, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c)) [1] TRUE > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100 coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100 coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- which(is.na(ans)) ans[i] <- elt[i] } ans }
showing that which()
can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE) > bb <- sample(b, 100000, TRUE) > cc <- sample(c, 100000, TRUE) > microbenchmark(coalesce1(aa, bb, cc), + coalesce1a(aa, bb, cc), + coalesce2(aa,bb,cc), times=10) Unit: milliseconds expr min lq median uq max neval coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10 coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10 coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With