Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Round to a multiple and filter in data.table

Tags:

r

data.table

I have very interesting problem, though I'd rather not to have one. I have to round a number to a closes multiple so I followed the solution here It used to work OK, until I've discover the bug with data.table

library(data.table)
options(digits = 20) # to see number representation
mround <- function (number, multiple) {
   return(multiple * round(number / multiple))
}
DT = data.table(a = mround(112.3, 0.1), b = "B")
DT[a == 112.3,] # works as expected, i.e returns one row
DT[a == 112.3 & b == 'B', ] # doesn't work

To be fair, with data.frame even the first filter doesn't work. Any ideas how to fix that?

like image 629
kismsu Avatar asked Feb 05 '23 12:02

kismsu


2 Answers

Just to add to @Tens great answer.

What seem to be happening are three things

  • You have a floating point issue (as mentioned already)
  • You are using and old data.table version
  • Secondary indices are kicking in while you aren't aware of it

Using your setup

library(data.table)
options(digits = 20) # to see number representation

mround <- function (number, multiple) {
  return(multiple * round(number / multiple))
}

DT = data.table(a = mround(112.3, 0.1), b = "B")

So lets address the points above. Since you have a floating point and quoting ?setNumericRounding

Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type 'numeric'; i.e. 'double

This led data.table devs to implement the setNumericRounding which auto rounded floats so a the radix algorithm would behave as expected.

Prior to v1.9.8, setNumericRounding(2) was the default (hence your first example works), but after some complaints from users for inconsistency on GH (IIRC), since v1.9.8 the default was set back to setNumericRounding(0) in order to be consistent with data.frame behavior (see here). So if you'll update your data.table to the latest version, you will see that both data.table and data.frame will behave the same for your both examples (and both of your examples will fail).

Compare

setNumericRounding(0)
DT[a == 112.3]
## Empty data.table (0 rows) of 2 cols: a,b

To

setNumericRounding(1)
DT[a == 112.3]
#                     a b
# 1: 112.30000000000001 B

So you will ask, "what on earth radix algorithm has to do with anything here". So here we reach the third point above- secondary indices (please read this). Lets see what actually happens when you are running you code above

options(datatable.verbose = TRUE)
DT[a == 112.3] # works as expected, i.e returns one row
# Creating new index 'a' <~~~~
# forder took 0 sec
# Starting bmerge ...done in 0 secs
#                     a b
# 1: 112.30000000000001 B

Lets checks the new secondary indices

indices(DT)
#[1] "a"

when you've ran ==, data.table set a as your secondary index in order to perform future operations much more efficiently (this was introduced in v1.9.4, see here). In other words, you performed a binary join on a instead the usual vector scan like it was prior v1.9.4 (As a side note, this can be disabled by doing options(datatable.auto.index = FALSE), in that case, none of your examples will work even with setNumericRounding(1) unless you will explicitly specify a key using setkey or the on argument)

This is probably will also explain why

DT[a == 112.30000 & b == 'B'] 

doesn't work. You are sub-setting here by two columns and neither secondary indices or binary join don't (automatically) kick-in for an expressions such as == & == (yet), hence you did a normal vector scan and setNumericRounding(1) didn't kick in

Though, you can set the keys manually and make it work, for instance (like I commented under @Tens answer), you can do

setNumericRounding(1) # make sure autoroundings turned on
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
#        a b
# 1: 112.3 B

Or using the old way

setkey(DT, a, b)
DT[.(112.3, 'B'), nomatch = 0L]
# Starting bmerge ...done in 0 secs
#        a b
# 1: 112.3 B
like image 72
David Arenburg Avatar answered Feb 19 '23 00:02

David Arenburg


It's a problem of floating point precision. See DT[abs(a - 112.3)<1.e-6 & b == 'B',] using an error margin of 0.000001 will give you proper result.

If you want more precision you can use .Machine$double.eps^0.5 as does all.equal.

General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here

One way to fix your problem could be to refactor your function to:

mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {

   round(multiple * round(number / multiple),digits)
}

I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).
I removed the unnecessary return which just cause the interpreter to look for a function already called at end of the function call.

This way your output should be precise enough, but you'll still have corner cases:

> mround(112.37,0.2)
[1] 112.40000000000001

To use floats in joins, you can use (courtesy of David Arenburg):

setNumericRounding(1)
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]
like image 20
Tensibai Avatar answered Feb 18 '23 23:02

Tensibai