Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R's data.table Truncating Bits?

Tags:

r

data.table

So I'm a huge data.table fan in R. I use it almost all the time but have come across a situation in which it won't work for me at all. I have a package (internal to my company) that uses R's double to store the value of an unsigned 64 bit integer whose bit sequence corresponds to some fancy encoding. This package works very nicely everywhere except data.table. I found that if I aggregate on a column of this data that I lose a large number of my unique values. My only guess here is that data.table is truncating bits in some kind of weird double optimization.

Can anyone confirm that this is the case? Is this simply a bug?

Below see a reproduction of the issue and comparison to the package I currently must use but want to avoid with a passion (dplyr).

temp <- structure(list(obscure_math = c(6.95476896592629e-309, 6.95476863436446e-309, 
6.95476743245288e-309, 6.95476942182375e-309, 6.95477149408563e-309, 
6.95477132830476e-309, 6.95477132830476e-309, 6.95477149408562e-309, 
6.95477174275702e-309, 6.95476880014538e-309, 6.95476896592647e-309, 
6.95476896592647e-309, 6.95476900737172e-309, 6.95476900737172e-309, 
6.95476946326899e-309, 6.95476958760468e-309, 6.95476958760468e-309, 
6.95477020928318e-309, 6.95477124541406e-309, 6.95476859291965e-309, 
6.95476875870014e-309, 6.95476904881676e-309, 6.95476904881676e-309, 
6.95476904881676e-309, 6.95476909026199e-309, 6.95476909026199e-309, 
6.95476909026199e-309, 6.95476909026199e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.95477211576406e-309, 
6.95476880014538e-309, 6.95476880014538e-309, 6.95476880014538e-309, 
6.95476892448104e-309, 6.95476880014538e-309, 6.95476892448105e-309, 
6.9547689659263e-309, 6.95476913170719e-309, 6.95476933893334e-309
)), .Names = "obscure_math", class = c("data.table", "data.frame"), row.names = c(NA, 
-50L))

dt_collapsed <- temp[, .(count=.N), by=obscure_math]
nrow(dt_collapsed) == length(unique(temp$obscure_math))

setDF(temp)
dplyr_collapsed <- temp %>% group_by(obscure_math) %>% summarise(count=n())
nrow(dplyr_collapsed) == length(unique(temp$obscure_math))
like image 438
stanekam Avatar asked Jun 04 '16 00:06

stanekam


1 Answers

Update: the default rounding feature has been removed in the current development version of data.table (v1.9.7). See installation instructions for devel version here.

This also means that you're responsible for understanding the limitations in representing floating point numbers and dealing with it.


data.table has been around for a long time. We used to deal with limitations in floating point representations by using a threshold (like base R does, e.g., all.equal). However it simply does not work, since it needs to be adaptive depending on how big the numbers compared are. This series of articles is an excellent read on this topic and other potential issues.

This being a recurring issue because a) people don't realise the limitations, or b) thresholding did not really help their issue, meant that people kept asking here or posting on the project page.

While we reimplemented data.table's order to fast radix ordering, we took the opportunity to provide an alternative way of fixing the issue, and providing a way out if it proves undesirable (exporting setNumericRounding). With #1642 issue, ordering probably doesn't need to have rounding of doubles (but it's not that simple, since order directly affects binary search based subsets).

The actual problem here is grouping on floating point numbers, even worse is such numbers as in your case. That is just a bad choice IMHO.

I can think of two ways forward:

  1. When grouping on columns that are really doubles (in R, 1 is double as opposed to 1L, and those cases don't have issues) we provide a warning that the last 2 bytes are rounded off, and that people should read ?setNumericRounding. And also suggest using bit64::integer64.

  2. Remove the functionality of allowing grouping operations on really double values or force them to fix the precision to certain digits before continuing. I can't think of a valid reason why one would want to group by floating point numbers really (would love to hear from people who do).

What is very unlikely to happen is going back to thresholding based checks for identifying which doubles should belong to the same group.

Just so that the Q remains answered, use setNumericRounding(0L).

like image 177
Arun Avatar answered Nov 11 '22 02:11

Arun