I recently upgraded data.table from 1.8.10 to 1.9.2, and I found the following difference between the two versions when grouping across large integers.
Is there a setting that I need to change in 1.9.2 to have the first of the following two group statements work as it did in 1.8.10 (and I presume 1.8.10 is the correct behavior)?
Also, the results are the same in the two packages for the second of the following two group statements, but is that behavior expected?
1.8.10
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> foo = data.table(i = c(2884199399609098249, 2884199399608934409))
> lapply(foo, class)
$i
[1] "numeric"
> foo
i
1: 2884199399609098240
2: 2884199399608934400
> foo[, .N, by=i]
i N
1: 2884199399609098240 1
2: 2884199399608934400 1
> foo = data.table(i = c(9999999999999999999, 9999999999999999998))
> foo[, .N, by=i]
i N
1: 10000000000000000000 2
>
And 1.9.2
> library(data.table)
data.table 1.9.2 For help type: help("data.table")
> foo = data.table(i = c(2884199399609098249, 2884199399608934409))
> lapply(foo, class)
$i
[1] "numeric"
> foo
i
1: 2884199399609098240
2: 2884199399608934400
> foo[, .N, by=i]
i N
1: 2884199399609098240 2
> foo = data.table(i = c(9999999999999999999, 9999999999999999998))
> foo[, .N, by=i]
i N
1: 10000000000000000000 2
>
The numbers used for the first test (shows difference between the data.table versions) are the numbers from my actual dataset, and the ones that caused a few of my regression tests to fail after upgrading data.table.
I'm curious about the second test, after I increase the numbers by another order of magnitude, if it is expected in both versions of the data.table package to ignore minor differences in the last significant digit.
I'm assuming this all has to do with floating-point representation. Maybe the correct way for me to handle this is to represent these large integers either as integer64 or character? I'm hesitant to do integer64 as I'm not sure if data.table and the R environment fully support them, e.g., I've had to add this in previous data.table code:
options(datatable.integer64="character") # Until integer64 setkey is implemented
Maybe that has been implemented, but regardless changing that setting does not change the results of these tests at least in my environment. I suppose that that makes sense given that these values are stored as numeric in the foo
data table.
Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :
Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
That meant we went backwards on supporting integers > 2^31 stored in type numeric
. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :
o
bit64::integer64
now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.
Reminder:fread()
has been able to detect and readinteger64
for a while.o New function
setNumericRounding()
may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of typenumeric
, #5369. See example in?setNumericRounding
and NEWS item from v1.9.2.getNumericRounding()
returns the current setting.
So you can either call setNumericRounding(0)
to switch off rounding globally for all numeric
columns, or better, use the more appropriate type for the column: bit64::integer64
now that it's supported.
The change in v1.9.2 was :
o Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].
The example in ?setNumericRounding
is :
> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0) # turn off rounding; i.e. if we didn't round
> DT[.(0.4)] # works
a b
1: 0.4 1
> DT[.(0.6)] # no match!, confusing to users
a b # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>
> setNumericRounding(2) # restore default
> DT[.(0.6)] # now works as user expects
a b
1: 0.6 2
>
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id] # 1 row (the last digit has been rounded)
id N
1: 1.234568e+12 3
> setNumericRounding(0) # turn off rounding
> DT[,.N,by=id] # 3 rows (the last digit wasn't rounded)
id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
> # but, better to use bit64::integer64 for such ids instead of numeric
> setNumericRounding(2) # restore default, preferred
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With