Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How can a function accept variable arguments using ellipsis (...) without copying them in memory?

[EDIT: The issue prompting this workaround has been fixed since R 3.1.0.]

I was asked elsewhere to post this as a self-answered question.

When an R function accepts an arbitrary number of parameters through the ellipsis arguments, the common way to access them is using list(...):

f <- function(...) {
  dots <- list(...)

  # Let's print them out.
  for (i in seq_along(dots)) {
    cat(i, ": name=", names(dots)[i], "\n", sep="")
    print(dots[[i]])
  }
}

> f(10, a=20)
1: name=
[1] 10
2: name=a
[1] 20

However, R (as of v3.0.2) deep-copies all list elements:

> x <- 10
> .Internal(inspect(x))
@10d85ca68 14 REALSXP g0c1 [MARK,NAM(2),TR] (len=1, tl=0) 10

> x2 <- x
> .Internal(inspect(x2))  # Not copied.
@10d85ca68 14 REALSXP g0c1 [MARK,NAM(2),TR] (len=1, tl=0) 10

> y <- list(x)
> .Internal(inspect(y[[1]]))  # x was copied to a different address:
@10dd45e88 14 REALSXP g0c1 [MARK,NAM(1),TR] (len=1, tl=0) 10

> z <- list(y)
> .Internal(inspect(z))  # y was deep-copied:
@10d889ed8 19 VECSXP g0c1 [MARK,NAM(1)] (len=1, tl=0)
  @10d889f38 19 VECSXP g0c1 [MARK,TR] (len=1, tl=0)
    @10d889f68 14 REALSXP g0c1 [MARK] (len=1, tl=0) 10

You can verify this with tracemem as well, if you have memory profiling enabled.

So you've been storing large objects in a list? Copied. Passing them into any function that calls list(...) inside? Copied:

> g <- function(...) for (x in list(...)) .Internal(inspect(x))
> g(z)  # Copied.
@10dd45e58 19 VECSXP g0c1 [] (len=1, tl=0)
  @10dd35fa8 19 VECSXP g0c1 [] (len=1, tl=0)
    @10dd36068 19 VECSXP g0c1 [] (len=1, tl=0)
      @10dd36158 14 REALSXP g0c1 [] (len=1, tl=0) 10
> g(z)  # ...copied again.
@10dd32268 19 VECSXP g0c1 [] (len=1, tl=0)
  @10d854c68 19 VECSXP g0c1 [] (len=1, tl=0)
    @10d8548d8 19 VECSXP g0c1 [] (len=1, tl=0)
      @10d8548a8 14 REALSXP g0c1 [] (len=1, tl=0) 10

Not horrified yet? Try grep -l "list(\.\.\.)" *.R in R library sources. My favorite is mapply/Map, which I was routinely calling on GBs of data and wondering why memory was running out. At least lapply is fine.

So, how can I write a variadic function with ... arguments and avoid copying them?

like image 647
codeola Avatar asked Mar 04 '14 22:03

codeola


1 Answers

We can expand ... arguments using match.call and then evaluate and store the arguments in an environment which will not copy the values. Since environment objects require names for all elements and don't preserve their ordering, we need to store a separate vector of ordered tag names in addition to the (optional) formal argument names. Implemented here using attributes:

argsenv <- function(..., parent=parent.frame()) {
  cl <- match.call(expand.dots=TRUE)

  e <- new.env(parent=parent)
  pf <- parent.frame()
  JJ <- seq_len(length(cl) - 1)
  tagnames <- sprintf(".v%d", JJ)
  for (i in JJ) e[[tagnames[i]]] <- eval(cl[[i+1]], envir=pf)

  attr(e, "tagnames") <- tagnames
  attr(e, "formalnames") <- names(cl)[-1]
  class(e) <- c("environment", "argsenv")
  e
}

Now we can use it in our functions instead of list(...):

f <- function(...) {
  dots <- argsenv(...)

  # Let's print them out.
  for (i in seq_along(attr(dots, "tagnames"))) {
    cat(i, ": name=", attr(dots, "formalnames")[i], "\n", sep="")
    print(dots[[attr(dots, "tagnames")[i]]])
  }
}

> f(10, a=20)
1: name=
[1] 10
2: name=a
[1] 20

So it works, but does it avoid copying?

g1 <- function(...) {
  dots <- list(...)
  for (x in dots) .Internal(inspect(x))
}

> z <- 10
> .Internal(inspect(z))
@10d854908 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 10
> g1(z)
@10dcdaba8 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 10
> g1(z, z)
@10dcbb558 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 10
@10dcd53d8 14 REALSXP g0c1 [NAM(2)] (len=1, tl=0) 10
> 

g2 <- function(...) {
   dots <- argsenv(...);
   for (x in attr(dots, "tagnames")) .Internal(inspect(dots[[x]]))
}

> .Internal(inspect(z))
@10d854908 14 REALSXP g0c1 [MARK,NAM(2)] (len=1, tl=0) 10
> g2(z)
@10d854908 14 REALSXP g0c1 [MARK,NAM(2)] (len=1, tl=0) 10
> g2(z, z)
@10d854908 14 REALSXP g0c1 [MARK,NAM(2)] (len=1, tl=0) 10
@10d854908 14 REALSXP g0c1 [MARK,NAM(2)] (len=1, tl=0) 10

You could implement this in S4 with slots instead of attributes, define all sorts of methods (length, [, [[, c, etc.) for it, and turn it into a full-fledged general-purpose non-copying replacement for list. But that's another post.

Side note: You can avoid mapply/Map by rewriting all such calls as lapply(seq_along(v1) function(i) FUN(v1[[i]], v2[[i]],...), but that's a lot of work and doesn't do your code any favors in elegance and readability. Instead, we can rewrite the mapply/Map functions using argsenv and some expression manipulation to do exactly that inside:

mapply2 <- function(FUN, ..., MoreArgs=NULL, SIMPLIFY=TRUE, USE.NAMES=TRUE) {
  FUN <- match.fun(FUN)

  args <- argsenv(...)
  tags <- attr(args, "tagnames")
  iexpr <- quote(.v1[[i]])
  iargs <- lapply(tags, function(x) { iexpr[[2]] <- as.name(x); iexpr })
  names(iargs) <- attr(args, "formalnames")
  iargs <- c(iargs, as.name("..."))
  icall <- quote(function(i, ...) FUN())[-4]
  icall[[3]] <- as.call(c(quote(FUN), iargs))
  ifun <- eval(icall, envir=args)

  lens <- sapply(tags, function(x) length(args[[x]]))
  maxlen <- if (length(lens) == 0) 0 else max(lens)
  if (any(lens != maxlen)) stop("Unequal lengths; recycle not implemented")

  answer <- do.call(lapply, c(list(seq_len(maxlen), ifun), MoreArgs))

  # The rest is from the original mapply code.

  if (USE.NAMES && length(tags)) {
    arg1 <- args[[tags[1L]]]
    if (is.null(names1 <- names(arg1)) && is.character(arg1)) names(answer) <- arg1
    else if (!is.null(names1)) names(answer) <- names1
  }

  if (!identical(SIMPLIFY, FALSE) && length(answer)) 
      simplify2array(answer, higher = (SIMPLIFY == "array"))
  else answer
}

# Original Map code, but calling mapply2 instead.
Map2 <- function (f, ...) {
  f <- match.fun(f)
  mapply2(FUN=f, ..., SIMPLIFY=FALSE)
}

You could even name them mapply/Map in your package/global namespace to shadow the base versions and not have to modify the rest of your code. The implementation here is only missing the unequal length recycling feature, which you could add if you wanted to.

like image 119
codeola Avatar answered Oct 06 '22 16:10

codeola