Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

understanding optimisation messages on assignment by reference in a data.table

Tags:

r

data.table

This is from an observation during my answering this question from @sds here.

First, let me switch on the trace messages for data.table:

options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")

Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:

dt.out <- dt[, lapply(.SD, sum), by = a]

Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:

dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]

In this assignment by reference, one of the messages data.table produces is:

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..

But if I do:

dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]

Then, I get the message:

Direct plonk of unnamed RHS, no copy.

As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?

Even if not, why the changes in assignment by reference between two internally? Any ideas?

To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?

like image 538
Arun Avatar asked Apr 22 '13 16:04

Arun


1 Answers

Update: The expression,

DT[, c(..., lapply(.SD, .), ..., by=.]

has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).

For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]

But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet. This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.


Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

What would be better is if optimization of j in data.table could handle :

DT[, c(lapply(.SD,sum),.N), by=a]

That works but may be slow. Currently only the simpler form is optimized :

DT[, lapply(.SD,sum), by=a]

To answer main question, yes the following :

Direct plonk of unnamed RHS, no copy.

is desirable compared to :

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

Another way to achieve this is :

dt.out[, count := dt[, .N, by=a]$N]

I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

like image 180
Matt Dowle Avatar answered Oct 19 '22 18:10

Matt Dowle