Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Operator == inconsistent in logical columns in data.table

Tags:

r

data.table

Please see the following reproducible example:

library(data.table)
set.seed(123)
DT <- data.table(A=rep(0.3,10000))
DT[, B := runif(.N) < A]
DT[B == T, .N]
# [1] 3005
DT[, summary(B)]
#    Mode   FALSE    TRUE    NA's
# logical    6995    3005       0

Everything looks fine and the count of "TRUE" values is the same for the 2 methods. Now replace col B with a new one.

DT[, B := runif(.N) < A]
DT[B == T, .N]
# [1] 3331
DT[, summary(B)]
#    Mode   FALSE    TRUE    NA's
# logical    6981    3019       0 

The count of 'T' in the column B is different!!! It is the same column but one method gives 3331 "TRUE" values and the other 3019.

When == is bypassed

DT[B != F, .N]
# [1] 3019
DT[, summary(B)]
#    Mode   FALSE    TRUE    NA's
# logical    6981    3019       0 

Which is correct again

I can reproduce it with data.table v1.94 and 1.9.5 on Windows 8.1 x64.


Here's a much easier reproducible example without runif().

require(data.table) ## 1.9.4+
DT = data.table(x = 1:5)
DT[, y := x <= 2L]
#    x     y
# 1: 1  TRUE
# 2: 2  TRUE
# 3: 3 FALSE
# 4: 4 FALSE
# 5: 5 FALSE

DT[y == TRUE, .N]
# [1] 2             <~~~~~~ correct result.

DT[, y := x <= 3L]
#    x     y
# 1: 1  TRUE
# 2: 2  TRUE
# 3: 3  TRUE
# 4: 4 FALSE
# 5: 5 FALSE

DT[y == TRUE, .N]
# [1] 2             <~~~~~~ incorrect result, should be 3!
like image 212
ChristK Avatar asked Oct 10 '14 21:10

ChristK


2 Answers

Now fixed in v1.9.5 on GitHub.

:= and set* now drop secondary keys (new in v1.9.4) so that DT[x==y] works again after a := or set* without needing options(datatable.auto.index=FALSE). Only setkey() was dropping secondary keys correctly. 23 tests added. Thanks to user36312 for reporting, #885.

like image 158
Matt Dowle Avatar answered Sep 21 '22 05:09

Matt Dowle


Have a look at what @nrussell suggested. This also could be a bug, according to @Eddi. Below could be a temporary work-around. Also suggested by @Arun. Please refer to the exchange of comments.

Case 1

> set.seed(123)
> DT <- data.table(A=rep(0.3,10000))
> DT[, B := runif(.N) < A]
> DT[B == T, .N]
[1] 3012
> DT[, summary(B)]
   Mode   FALSE    TRUE    NA's 
logical    6988    3012       0 

Case 2

> set.seed(123)
> DT[, B := runif(.N) < A]
> DT[B == T, .N]
[1] 3012
> DT[, summary(B)]
   Mode   FALSE    TRUE    NA's 
logical    6988    3012       0 

Case 3

> set.seed(123)
> DT[, B := runif(.N) < A]
> DT[B != F, .N]
[1] 3012
> DT[, summary(B)]
   Mode   FALSE    TRUE    NA's 
logical    6988    3012       0 
like image 43
KFB Avatar answered Sep 21 '22 05:09

KFB