.N differs from length(variable) when variable is used in `by`

Question

Going around the data.table vignette “Introduction to data.table”, the example in section 2 “Aggregations” is just like that

ans <- flights[, .(.N), by = .(origin)]
ans
#    origin     N
#    <char> <int>
# 1:    JFK 81483
# 2:    LGA 84433
# 3:    EWR 87400

Relacing .N with length of e.g. "year" gives the same number of rows per group:

> flights[, .(length(year)), by = .(origin)]
   origin    V1
1:    JFK 81483
2:    LGA 84433
3:    EWR 87400

or

> flights[, .(length(carrier)), by = .(origin)]
   origin    V1
1:    JFK 81483
2:    LGA 84433
3:    EWR 87400

That was expected. But, when I use length(origin), i.e. the same variable as used as grouping variable in by, a different calculations is performed: the result is 1:

> flights[, .(length(origin)), by = .(origin)]
   origin V1
1:    JFK  1
2:    LGA  1
3:    EWR  1

Are there any explanation for why this happens?

With a more complicated example it could passed unnoticed, so it seems safer to always use the built in .N than try to compute counts with the length function.

jangorecki · Accepted Answer

This is caused by the fact that variables used in by (or keyby) argument, or which you are grouping your results, has been already grouped to scalar (a particular group) when accessing them in j argument.

There are two related open issue that readers might find useful

https://github.com/Rdatatable/data.table/issues/1427
https://github.com/Rdatatable/data.table/issues/4079

.N differs from length(variable) when variable is used in `by`

Tags:

r

data.table

Numerari

1 Answers

jangorecki

Recent Activity

Donate For Us

.N differs from length(variable) when variable is used in `by`

Tags:

r

data.table

Numerari

1 Answers

jangorecki

Related questions

Recent Activity

Donate For Us