I'm working with a big data.table and using 'by' to 'group by' 3 variables.
My data.table is d
and has the key "ma" (10 digit integer but I have shortened it below).
But setting by="ma,year,month"
(which is to me the more intuitive group by statement) does not give what I want. For example, ma = 284 has 3 entries for Nov 2011, or ma= 672 has 2 entries for Dec 2011.
> d[,list(n=length(trx_num)),by=list(ma,year,month)]
ma year month n
1: 284 2011 12 3
2: 284 2012 1 1
3: 284 2011 11 5
4: 284 2011 11 1
5: 284 2011 11 2
---
5782971: 672 2012 7 1
5782972: 672 2011 12 1
5782973: 672 2012 2 1
5782974: 672 2011 12 1
5782975: 672 2012 1 1
Reversing the 'by' order, however, gives the desired result.
> d[,list(n=length(trx_num)),by=list(month,year,ma)]
month year ma n
1: 12 2011 284 3
2: 1 2012 284 1
3: 11 2011 284 8
4: 5 2012 543 7
5: 7 2012 543 3
---
1214686: 5 2012 672 28
1214687: 4 2012 672 13
1214688: 12 2011 672 11
1214689: 7 2012 672 9
1214690: 9 2012 672 11
What am I missing here? Thanks in advance.
EDIT:
str() of the data that gives the wrong result
> str(d)
Classes âdata.tableâ and 'data.frame': 14688135 obs. of 3 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ month: int 12 1 11 12 11 11 11 11 11 11 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "ma"
str() of the wrong result:
> str(d[,.N,by=list(ma,year,month)])
Classes âdata.tableâ and 'data.frame': 5782975 obs. of 4 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2012 2012 2012 2012 2012 ...
$ month: int 12 1 11 11 11 5 7 6 9 8 ...
$ N : int 3 1 5 1 2 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
And str() of right result:
> str(d[,.N,by=list(month,year,ma)])
Classes âdata.tableâ and 'data.frame': 1214690 obs. of 4 variables:
$ month: int 12 1 11 5 7 6 9 8 11 12 ...
$ year : int 2011 2012 2011 2012 2012 2012 2012 2012 2011 2011 ...
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ N : int 3 1 8 7 3 12 15 3 6 6 ...
- attr(*, ".internal.selfref")=<externalptr>
To wrap up following the comment trail, the ma
column was type numeric
and contained values which were precisely different but very close together, almost within machine tolerance but not quite. In other words, this situation :
x < y < z
(y-x) just less than machine tolerance so considered equal
(z-y) just less than machine tolerance so considered equal
(z-x) just over machine tolerance so considered not equal
When such a column is grouped alongside two other columns (i.e. by=
3 columns), the order of those 3 columns, if one of those columns has values like above, can change whether those values are considered equal (and in the same group) or not.
The solution is not to use type numeric
(double
is another name) for such data. Use integer
, or in this case where the integers were larger than 2^31 (giving rise to the coercion to double
and loss of accuracy, iiuc), character
instead. data.table
is fast at sorting integer
and character
. It's not as fast at sorting double
yet anyway.
We'll try and add a new warning
to data.table
:
FR#2469 Add new tolerance.warning
option to detect and issue warning
if any numeric
values are close but not quite within machine tolerance
I built a small test case that at one point in this dialog I thought exhibited the unexpected behavior, (but I was reading the wrong objects for comparison):
d <- structure(list(ma = c(284L, 284L, 284L, 284L, 284L, 284L, 284L,
284L, 284L, 284L, 284L, 284L, 672L, 672L, 672L, 672L, 672L),
year = c(2011L, 2011L, 2011L, 2012L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2011L, 2012L, 2011L,
2012L), month = c(12L, 12L, 12L, 1L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 7L, 12L, 2L, 12L, 1L), trx_num = c(4L,
9L, 8L, 4L, 4L, 6L, 3L, 8L, 2L, 2L, 8L, 9L, 8L, 6L, 10L,
6L, 10L)), .Names = c("ma", "year", "month", "trx_num"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x105afb0>, sorted = "ma")
To build it:
dat <- read.table(text=" ma year month n
284 2011 12 3
284 2012 1 1
284 2011 11 5
284 2011 11 1
284 2011 11 2
672 2012 7 1
672 2011 12 1
672 2012 2 1
672 2011 12 1
672 2012 1 1", header=TRUE)
require(data.table)
d <- data.table( data.frame(dat[rep(rownames(dat), times=dat$n), 1:3], trx_num=unlist(sapply(dat$n, sample, x=1:10)) ) )
setkey(d, ma)
d[,list(n=length(trx_num)),by=list(ma,year,month)]
d[,list(n=length(trx_num)),by=list(month,year,ma)]
At which point it becomes clear that BlueMagister's solution is correct:
d[,.N, by=list(month,year,ma)]
d[,.N, by=list(ma,year,month)] # same result modulo row order
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With