Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R's data.table, how is the key of a data.table preserved into subsets referenced using .SD?

Tags:

r

data.table

I am using the data.table package to complete some analyses. One of the steps I am taking involves using the by = function to obtain aggregate statistics. However, the aggregates must be calculated on the unique results in each by subset. I have been using unique and keys to ensure that each by group consists of distinct records. Something vaguely like the below:

dt_new <- dt_old[,uFunc_MyFunction(x = unique(.SD)),by = grouping_var]

I noticed that the key on .SD seemed to vary based on the key set for dt_old and the by = statement. Obviously, this was having an effect on whether my resulting subsets were unique or not.

I wanted to get some clarity, so I wrote the below.

library(data.table)
set.seed(1554)
dt_example <- data.table(id = 1:50,
                         site = sample(x = c("A","B","C"),
                                       size = 50,
                                       replace = TRUE,
                                       prob = c(0.4,0.4,0.2)),
                         group = sample(x = c("Eta","Mu","Omicron","Psi"),
                                        size = 50,
                                        replace = TRUE),
                         team = sample(x = 1:3,
                                       size = 50,
                                       replace = TRUE,
                                       prob = c(0.2,0.3,0.5)))

setkey(x = dt_example,
       group,
       team)

> dt_example[,as.list(key(.SD)),by = site]
   site    V1   V2
1:    B group team
2:    A group team
3:    C group team

setkey(x = dt_example,
       site,
       group,
       team)

> dt_example[,as.list(key(.SD)),by = site]
Empty data.table (0 rows) of 1 col: site

What I am trying to understand is why, in the first version, the key for .SD is consistent, while, in the second version, .SD had no key at all. I think it has something to do with the fact that the by = column isn't directly included in .SD, which is breaking the key, but I wanted to confirm my logic.

So, my question is this: why does the subset of a data table, .SD, have no key when one of the columns which comprises the key of the parent data table is used as a by grouping variable?

like image 684
TARehman Avatar asked Oct 19 '22 07:10

TARehman


1 Answers

In this case, since it's sorted by site, group, team, while grouping by site, the key could be retained for group, team as the order would be maintained. The simplest answer is we seem to have missed this case. Could you please file an issue with just a link to this post?

As a work around, you can use the by argument in unique method for data.tables to specify the columns.

And as David pointed out, using unique(.SD) on every group seems unnecessary, but that's probably for another Q.

like image 78
Arun Avatar answered Oct 21 '22 06:10

Arun