Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"by" in data.table (group by) - what am I missing?

Tags:

r

data.table

I'm working with a big data.table and using 'by' to 'group by' 3 variables.

My data.table is d and has the key "ma" (10 digit integer but I have shortened it below).

But setting by="ma,year,month" (which is to me the more intuitive group by statement) does not give what I want. For example, ma = 284 has 3 entries for Nov 2011, or ma= 672 has 2 entries for Dec 2011.

> d[,list(n=length(trx_num)),by=list(ma,year,month)]
      ma year month n
  1: 284 2011    12 3
  2: 284 2012     1 1
  3: 284 2011    11 5
  4: 284 2011    11 1
  5: 284 2011    11 2
 ---
5782971: 672 2012     7 1
5782972: 672 2011    12 1
5782973: 672 2012     2 1
5782974: 672 2011    12 1
5782975: 672 2012     1 1

Reversing the 'by' order, however, gives the desired result.

> d[,list(n=length(trx_num)),by=list(month,year,ma)]
     month year ma  n
  1:    12 2011 284  3
  2:     1 2012 284  1
  3:    11 2011 284  8
  4:     5 2012 543  7
  5:     7 2012 543  3
 ---
1214686:     5 2012 672 28
1214687:     4 2012 672 13
1214688:    12 2011 672 11
1214689:     7 2012 672  9
1214690:     9 2012 672 11

What am I missing here? Thanks in advance.

EDIT:

str() of the data that gives the wrong result

> str(d)
Classes âdata.tableâ and 'data.frame':  14688135 obs. of  3 variables:
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ year : int  2011 2012 2011 2011 2011 2011 2011 2011 2011 2011 ...
 $ month: int  12 1 11 12 11 11 11 11 11 11 ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr "ma"

str() of the wrong result:

> str(d[,.N,by=list(ma,year,month)])
Classes âdata.tableâ and 'data.frame':  5782975 obs. of  4 variables:
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ year : int  2011 2012 2011 2011 2011 2012 2012 2012 2012 2012 ...
 $ month: int  12 1 11 11 11 5 7 6 9 8 ...
 $ N    : int  3 1 5 1 2 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

And str() of right result:

> str(d[,.N,by=list(month,year,ma)])
Classes âdata.tableâ and 'data.frame':  1214690 obs. of  4 variables:
 $ month: int  12 1 11 5 7 6 9 8 11 12 ...
 $ year : int  2011 2012 2011 2012 2012 2012 2012 2012 2011 2011 ...
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ N    : int  3 1 8 7 3 12 15 3 6 6 ...
 - attr(*, ".internal.selfref")=<externalptr>
like image 583
hhh Avatar asked Jan 10 '13 20:01

hhh


2 Answers

To wrap up following the comment trail, the ma column was type numeric and contained values which were precisely different but very close together, almost within machine tolerance but not quite. In other words, this situation :

 x < y < z
 (y-x) just less than machine tolerance so considered equal
 (z-y) just less than machine tolerance so considered equal
 (z-x) just over machine tolerance so considered not equal

When such a column is grouped alongside two other columns (i.e. by= 3 columns), the order of those 3 columns, if one of those columns has values like above, can change whether those values are considered equal (and in the same group) or not.

The solution is not to use type numeric (double is another name) for such data. Use integer, or in this case where the integers were larger than 2^31 (giving rise to the coercion to double and loss of accuracy, iiuc), character instead. data.table is fast at sorting integer and character. It's not as fast at sorting double yet anyway.

We'll try and add a new warning to data.table :

FR#2469 Add new tolerance.warning option to detect and issue warning if any numeric values are close but not quite within machine tolerance

like image 128
Matt Dowle Avatar answered Nov 11 '22 16:11

Matt Dowle


I built a small test case that at one point in this dialog I thought exhibited the unexpected behavior, (but I was reading the wrong objects for comparison):

d <-     structure(list(ma = c(284L, 284L, 284L, 284L, 284L, 284L, 284L, 
284L, 284L, 284L, 284L, 284L, 672L, 672L, 672L, 672L, 672L), 
    year = c(2011L, 2011L, 2011L, 2012L, 2011L, 2011L, 2011L, 
    2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2011L, 2012L, 2011L, 
    2012L), month = c(12L, 12L, 12L, 1L, 11L, 11L, 11L, 11L, 
    11L, 11L, 11L, 11L, 7L, 12L, 2L, 12L, 1L), trx_num = c(4L, 
    9L, 8L, 4L, 4L, 6L, 3L, 8L, 2L, 2L, 8L, 9L, 8L, 6L, 10L, 
    6L, 10L)), .Names = c("ma", "year", "month", "trx_num"), row.names = c(NA, 
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x105afb0>, sorted = "ma")

To build it:

dat <- read.table(text=" ma year month n
 284 2011    12 3
 284 2012     1 1
 284 2011    11 5
 284 2011    11 1
 284 2011    11 2
 672 2012     7 1
 672 2011    12 1
 672 2012     2 1
 672 2011    12 1
 672 2012     1 1", header=TRUE)
require(data.table)
d <- data.table( data.frame(dat[rep(rownames(dat), times=dat$n), 1:3], trx_num=unlist(sapply(dat$n, sample, x=1:10)) ) )
setkey(d, ma)
d[,list(n=length(trx_num)),by=list(ma,year,month)]
d[,list(n=length(trx_num)),by=list(month,year,ma)]

At which point it becomes clear that BlueMagister's solution is correct:

d[,.N, by=list(month,year,ma)]
d[,.N, by=list(ma,year,month)] # same result modulo row order
like image 20
IRTFM Avatar answered Nov 11 '22 17:11

IRTFM