I have very interesting problem, though I'd rather not to have one. I have to round a number to a closes multiple so I followed the solution here It used to work OK, until I've discover the bug with data.table
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
DT[a == 112.3,] # works as expected, i.e returns one row
DT[a == 112.3 & b == 'B', ] # doesn't work
To be fair, with data.frame
even the first filter doesn't work. Any ideas how to fix that?
Just to add to @Tens great answer.
What seem to be happening are three things
Using your setup
library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
So lets address the points above. Since you have a floating point and quoting ?setNumericRounding
Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type 'numeric'; i.e. 'double
This led data.table
devs to implement the setNumericRounding
which auto rounded floats so a the radix algorithm would behave as expected.
Prior to v1.9.8, setNumericRounding(2)
was the default (hence your first example works), but after some complaints from users for inconsistency on GH (IIRC), since v1.9.8 the default was set back to setNumericRounding(0)
in order to be consistent with data.frame
behavior (see here). So if you'll update your data.table to the latest version, you will see that both data.table
and data.frame
will behave the same for your both examples (and both of your examples will fail).
Compare
setNumericRounding(0)
DT[a == 112.3]
## Empty data.table (0 rows) of 2 cols: a,b
To
setNumericRounding(1)
DT[a == 112.3]
# a b
# 1: 112.30000000000001 B
So you will ask, "what on earth radix algorithm has to do with anything here". So here we reach the third point above- secondary indices (please read this). Lets see what actually happens when you are running you code above
options(datatable.verbose = TRUE)
DT[a == 112.3] # works as expected, i.e returns one row
# Creating new index 'a' <~~~~
# forder took 0 sec
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.30000000000001 B
Lets checks the new secondary indices
indices(DT)
#[1] "a"
when you've ran ==
, data.table set a
as your secondary index in order to perform future operations much more efficiently (this was introduced in v1.9.4, see here). In other words, you performed a binary join on a
instead the usual vector scan like it was prior v1.9.4 (As a side note, this can be disabled by doing options(datatable.auto.index = FALSE)
, in that case, none of your examples will work even with setNumericRounding(1)
unless you will explicitly specify a key using setkey
or the on
argument)
This is probably will also explain why
DT[a == 112.30000 & b == 'B']
doesn't work. You are sub-setting here by two columns and neither secondary indices or binary join don't (automatically) kick-in for an expressions such as == & ==
(yet), hence you did a normal vector scan and setNumericRounding(1)
didn't kick in
Though, you can set the keys manually and make it work, for instance (like I commented under @Tens answer), you can do
setNumericRounding(1) # make sure autoroundings turned on
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
Or using the old way
setkey(DT, a, b)
DT[.(112.3, 'B'), nomatch = 0L]
# Starting bmerge ...done in 0 secs
# a b
# 1: 112.3 B
It's a problem of floating point precision.
See DT[abs(a - 112.3)<1.e-6 & b == 'B',]
using an error margin of 0.000001 will give you proper result.
If you want more precision you can use .Machine$double.eps^0.5
as does all.equal
.
General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here
One way to fix your problem could be to refactor your function to:
mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {
round(multiple * round(number / multiple),digits)
}
I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).
I removed the unnecessary return
which just cause the interpreter to look for a function already called at end of the function call.
This way your output should be precise enough, but you'll still have corner cases:
> mround(112.37,0.2)
[1] 112.40000000000001
To use floats in joins, you can use (courtesy of David Arenburg):
setNumericRounding(1)
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With