Going around the data.table
vignette “Introduction to data.table”, the example in section 2 “Aggregations” is just like that
ans <- flights[, .(.N), by = .(origin)]
ans
# origin N
# <char> <int>
# 1: JFK 81483
# 2: LGA 84433
# 3: EWR 87400
Relacing .N
with length
of e.g. "year" gives the same number of rows per group:
> flights[, .(length(year)), by = .(origin)]
origin V1
1: JFK 81483
2: LGA 84433
3: EWR 87400
or
> flights[, .(length(carrier)), by = .(origin)]
origin V1
1: JFK 81483
2: LGA 84433
3: EWR 87400
That was expected. But, when I use length(origin)
, i.e. the same variable as used as grouping variable in by
, a different calculations is performed: the result is 1:
> flights[, .(length(origin)), by = .(origin)]
origin V1
1: JFK 1
2: LGA 1
3: EWR 1
Are there any explanation for why this happens?
With a more complicated example it could passed unnoticed, so it seems safer to always use the built in .N
than try to compute counts with the length
function.
This is caused by the fact that variables used in by
(or keyby
) argument, or which you are grouping your results, has been already grouped to scalar (a particular group) when accessing them in j
argument.
There are two related open issue that readers might find useful
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With