Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.N differs from length(variable) when variable is used in `by`

Tags:

r

data.table

Going around the data.table vignette “Introduction to data.table”, the example in section 2 “Aggregations” is just like that

ans <- flights[, .(.N), by = .(origin)]
ans
#    origin     N
#    <char> <int>
# 1:    JFK 81483
# 2:    LGA 84433
# 3:    EWR 87400

Relacing .N with length of e.g. "year" gives the same number of rows per group:

> flights[, .(length(year)), by = .(origin)]
   origin    V1
1:    JFK 81483
2:    LGA 84433
3:    EWR 87400

or

> flights[, .(length(carrier)), by = .(origin)]
   origin    V1
1:    JFK 81483
2:    LGA 84433
3:    EWR 87400

That was expected. But, when I use length(origin), i.e. the same variable as used as grouping variable in by, a different calculations is performed: the result is 1:

> flights[, .(length(origin)), by = .(origin)]
   origin V1
1:    JFK  1
2:    LGA  1
3:    EWR  1

Are there any explanation for why this happens?

With a more complicated example it could passed unnoticed, so it seems safer to always use the built in .N than try to compute counts with the length function.

like image 738
Numerari Avatar asked Nov 16 '22 06:11

Numerari


1 Answers

This is caused by the fact that variables used in by (or keyby) argument, or which you are grouping your results, has been already grouped to scalar (a particular group) when accessing them in j argument.

There are two related open issue that readers might find useful

  • https://github.com/Rdatatable/data.table/issues/1427
  • https://github.com/Rdatatable/data.table/issues/4079
like image 103
jangorecki Avatar answered Mar 24 '23 04:03

jangorecki