Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does lapply() not retain my data.table keys?

I have a bunch of data.tables in a list. I want to apply unique() to each data.table in my list, but doing so destroys all my data.table keys.

Here's an example:

A <- data.table(a = rep(c("a","b"), each = 3), b = runif(6), key = "a")
B <- data.table(x = runif(6), b = runif(6), key = "x")

blah <- unique(A)

Here, blah still has a key, and everything is right in the world:

key(blah)

# [1] "a"

But if I add the data.tables to a list and use lapply(), the keys get destroyed:

dt.list <- list(A, B)

unique.list <- lapply(dt.list, unique) # Keys destroyed here

lapply(unique.list, key) 

# [[1]]
# NULL

# [[2]]
# NULL

This probably has to do with me not really understanding what it means for keys to be assigned "by reference," as I've had other problems with keys disappearing.

So:

  • Why does lapply not retain my keys?
  • What does it mean to say keys are assigned "by reference"?
  • Should I even be storing data.tables in a list?
  • How can I safely store/manipulate data.tables without fear of losing my keys?

EDIT:

For what it's worth, the dreaded for loop works just fine, too:

unique.list <- list()

for (i in 1:length(dt.list)) {
  unique.list[[i]] <- unique(dt.list[[i]])
}

lapply(unique.list, key)

# [[1]]
# [1] "a"

# [[2]]
# [1] "x"

But this is R, and for loops are evil.

like image 267
Paul Murray Avatar asked Feb 18 '13 01:02

Paul Murray


2 Answers

Interestingly, notice the difference between these two different results

lapply(dt.list, unique) 
lapply(dt.list, function(x) unique(x)) 

If you use the latter, the results are as you would expect.


The seemingly unexpected behavior is due to the fact that the first lapply statement is invoking unique.data.frame (ie from {base}) while the second is invoking unique.data.table

like image 72
Ricardo Saporta Avatar answered Nov 03 '22 02:11

Ricardo Saporta


Good question. It turns out that it's documented in ?lapply (see Note section) :

For historical reasons, the calls created by lapply are unevaluated, and code has been written (e.g. bquote) that relies on this. This means that the recorded call is always of the form FUN(X[[0L]], ...), with 0L replaced by the current integer index. This is not normally a problem, but it can be if FUN uses sys.call or match.call or if it is a primitive function that makes use of the call. This means that it is often safer to call primitive functions with a wrapper, so that e.g. lapply(ll, function(x) is.numeric(x)) is required in R 2.7.1 to ensure that method dispatch for is.numeric occurs correctly.

like image 41
Matt Dowle Avatar answered Nov 03 '22 02:11

Matt Dowle