I'm working with a big data.table and using 'by' to 'group by' 3 variables.
My data.table is d and has the key "ma" (10 digit integer but I have shortened it below).
But setting by="ma,year,month" (which is to me the more intuitive group by statement) does not give what I want. For example, ma = 284 has 3 entries for Nov 2011, or ma= 672 has 2 entries for Dec 2011.
> d[,list(n=length(trx_num)),by=list(ma,year,month)]
ma year month n
1: 284 2011 12 3
2: 284 2012 1 1
3: 284 2011 11 5
4: 284 2011 11 1
5: 284 2011 11 2
---
5782971: 672 2012 7 1
5782972: 672 2011 12 1
5782973: 672 2012 2 1
5782974: 672 2011 12 1
5782975: 672 2012 1 1
Reversing the 'by' order, however, gives the desired result.
> d[,list(n=length(trx_num)),by=list(month,year,ma)]
month year ma n
1: 12 2011 284 3
2: 1 2012 284 1
3: 11 2011 284 8
4: 5 2012 543 7
5: 7 2012 543 3
---
1214686: 5 2012 672 28
1214687: 4 2012 672 13
1214688: 12 2011 672 11
1214689: 7 2012 672 9
1214690: 9 2012 672 11
What am I missing here? Thanks in advance.
EDIT:
str() of the data that gives the wrong result
> str(d)
Classes âdata.tableâ and 'data.frame': 14688135 obs. of 3 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ month: int 12 1 11 12 11 11 11 11 11 11 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "ma"
str() of the wrong result:
> str(d[,.N,by=list(ma,year,month)])
Classes âdata.tableâ and 'data.frame': 5782975 obs. of 4 variables:
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ year : int 2011 2012 2011 2011 2011 2012 2012 2012 2012 2012 ...
$ month: int 12 1 11 11 11 5 7 6 9 8 ...
$ N : int 3 1 5 1 2 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
And str() of right result:
> str(d[,.N,by=list(month,year,ma)])
Classes âdata.tableâ and 'data.frame': 1214690 obs. of 4 variables:
$ month: int 12 1 11 5 7 6 9 8 11 12 ...
$ year : int 2011 2012 2011 2012 2012 2012 2012 2012 2011 2011 ...
$ ma : num 3e+10 3e+10 3e+10 3e+10 3e+10 ...
$ N : int 3 1 8 7 3 12 15 3 6 6 ...
- attr(*, ".internal.selfref")=<externalptr>
To wrap up following the comment trail, the ma column was type numeric and contained values which were precisely different but very close together, almost within machine tolerance but not quite. In other words, this situation :
x < y < z
(y-x) just less than machine tolerance so considered equal
(z-y) just less than machine tolerance so considered equal
(z-x) just over machine tolerance so considered not equal
When such a column is grouped alongside two other columns (i.e. by= 3 columns), the order of those 3 columns, if one of those columns has values like above, can change whether those values are considered equal (and in the same group) or not.
The solution is not to use type numeric (double is another name) for such data. Use integer, or in this case where the integers were larger than 2^31 (giving rise to the coercion to double and loss of accuracy, iiuc), character instead. data.table is fast at sorting integer and character. It's not as fast at sorting double yet anyway.
We'll try and add a new warning to data.table :
FR#2469 Add new tolerance.warning option to detect and issue warning if any numeric values are close but not quite within machine tolerance
I built a small test case that at one point in this dialog I thought exhibited the unexpected behavior, (but I was reading the wrong objects for comparison):
d <- structure(list(ma = c(284L, 284L, 284L, 284L, 284L, 284L, 284L,
284L, 284L, 284L, 284L, 284L, 672L, 672L, 672L, 672L, 672L),
year = c(2011L, 2011L, 2011L, 2012L, 2011L, 2011L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2011L, 2012L, 2011L,
2012L), month = c(12L, 12L, 12L, 1L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 7L, 12L, 2L, 12L, 1L), trx_num = c(4L,
9L, 8L, 4L, 4L, 6L, 3L, 8L, 2L, 2L, 8L, 9L, 8L, 6L, 10L,
6L, 10L)), .Names = c("ma", "year", "month", "trx_num"), row.names = c(NA,
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x105afb0>, sorted = "ma")
To build it:
dat <- read.table(text=" ma year month n
284 2011 12 3
284 2012 1 1
284 2011 11 5
284 2011 11 1
284 2011 11 2
672 2012 7 1
672 2011 12 1
672 2012 2 1
672 2011 12 1
672 2012 1 1", header=TRUE)
require(data.table)
d <- data.table( data.frame(dat[rep(rownames(dat), times=dat$n), 1:3], trx_num=unlist(sapply(dat$n, sample, x=1:10)) ) )
setkey(d, ma)
d[,list(n=length(trx_num)),by=list(ma,year,month)]
d[,list(n=length(trx_num)),by=list(month,year,ma)]
At which point it becomes clear that BlueMagister's solution is correct:
d[,.N, by=list(month,year,ma)]
d[,.N, by=list(ma,year,month)] # same result modulo row order
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With