I am using the data.table
package to complete some analyses. One of the steps I am taking involves using the by =
function to obtain aggregate statistics. However, the aggregates must be calculated on the unique results in each by
subset. I have been using unique
and keys to ensure that each by
group consists of distinct records. Something vaguely like the below:
dt_new <- dt_old[,uFunc_MyFunction(x = unique(.SD)),by = grouping_var]
I noticed that the key on .SD
seemed to vary based on the key set for dt_old
and the by =
statement. Obviously, this was having an effect on whether my resulting subsets were unique or not.
I wanted to get some clarity, so I wrote the below.
library(data.table)
set.seed(1554)
dt_example <- data.table(id = 1:50,
site = sample(x = c("A","B","C"),
size = 50,
replace = TRUE,
prob = c(0.4,0.4,0.2)),
group = sample(x = c("Eta","Mu","Omicron","Psi"),
size = 50,
replace = TRUE),
team = sample(x = 1:3,
size = 50,
replace = TRUE,
prob = c(0.2,0.3,0.5)))
setkey(x = dt_example,
group,
team)
> dt_example[,as.list(key(.SD)),by = site]
site V1 V2
1: B group team
2: A group team
3: C group team
setkey(x = dt_example,
site,
group,
team)
> dt_example[,as.list(key(.SD)),by = site]
Empty data.table (0 rows) of 1 col: site
What I am trying to understand is why, in the first version, the key for .SD
is consistent, while, in the second version, .SD
had no key at all. I think it has something to do with the fact that the by =
column isn't directly included in .SD
, which is breaking the key, but I wanted to confirm my logic.
So, my question is this: why does the subset of a data table, .SD
, have no key when one of the columns which comprises the key of the parent data table is used as a by
grouping variable?
In this case, since it's sorted by site, group, team
, while grouping by site
, the key could be retained for group, team
as the order would be maintained. The simplest answer is we seem to have missed this case. Could you please file an issue with just a link to this post?
As a work around, you can use the by
argument in unique
method for data.tables to specify the columns.
And as David pointed out, using unique(.SD)
on every group seems unnecessary, but that's probably for another Q.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With