Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table roll="nearest" returns multiple results

Tags:

r

data.table

I'm attempting to use data.table to match the nearest decimal value in a vector but am running into a situation where more than one result is returned. The simplified example below returns two values, 0.1818182 0.2727273, but using a less precise value for x (e.g. 0.0275) returns a single match (0.1818182).

x = 0.0275016249293408
dt = data.table(rnk = c(0, 0.0909090909090909, 
                        0.181818181818182, 0.272727272727273),
                val = c(0.0233775088495975, 0.0270831481152598, 
                        0.0275016216267234, 0.0275016249293408),
                key="val")
dt[J(x), roll="nearest"][, ifelse(is.na(val), NA_real_, rnk)]

I'm assuming the problem is related to the precision of the numeric values I'm using for this comparison. Is there a limitation to the decimal precision that can be used for a nearest match (i.e. do I need to round the data points)? Is there a better way to accomplish this nearest match?

like image 588
user338714 Avatar asked Feb 19 '14 15:02

user338714


2 Answers

Yes, data.table automatically applies a tolerance when joining and grouping numeric columns. The tolerance in v1.8.10 is sqrt(.Machine$double.eps) == 1.490116e-08. This comes directly from ?base::all.equal.

To illustrate, consider grouping :

> dt
          rnk        val
1: 0.00000000 0.02337751
2: 0.09090909 0.02708315
3: 0.18181818 0.02750162
4: 0.27272727 0.02750162

> dt[,.N,by=val]
          val N
1: 0.02337751 1
2: 0.02708315 1
3: 0.02750162 2    # one group, size two
>

When you joined using dt[J(x), roll="nearest"], that x value matched to within tolerance and you got the group it matched to, as usual when a matching value occurs in a rolling join. roll="nearest" only applies to the values that don't match, outside tolerance.

data.table considers the values in rows 3 and 4 of val to be equal. The thinking behind this is for convenience, since most of the time key values are really a fixed precision such as prices ($1.23) or recorded measurements to a specified precision (1.234567). We'd like to join and group such numerics even after multiplying them for example, without needing to code for machine accuracy ourselves. And we'd like to avoid confusion when numeric data displays as though it's equal in a table, but isn't due to very tiny differences in the bit representation.

See ?unique.data.table for this example :

DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10))   # example from ?all.equal
length(unique(DT$a))         # 10 strictly unique floating point values
all.equal(DT$a,rep(1,10))    # TRUE, all within tolerance of 1.0
DT[,which.min(a)]            # row 10, the strictly smallest floating point value
identical(unique(DT),DT[1])  # TRUE, stable within tolerance
identical(unique(DT),DT[10]) # FALSE

data.table is also stable within tolerance; i.e, when you group by a numeric, the original order of the items within that group are maintained as usual.

> dt$val[3] < dt$val[4]   # in your example data, 3 is strictly less than 4
[1] TRUE
> dt[, row:=1:4]  # add a row number to illustrate
> dt[, list(.N, list(row)), by=val]
          val N  V2
1: 0.02337751 1   1
2: 0.02708315 1   2
3: 0.02750162 2 3,4
> dt[3:4, val:=rev(val)]   # swap the two values around
> dt$val[3] > dt$val[4]
[1] TRUE
> dt[, list(.N, list(row)), by=val]
          val N  V2
1: 0.02337751 1   1
2: 0.02708315 1   2
3: 0.02750162 2 3,4    # same result, consistent. stable within tolerance
like image 126
Matt Dowle Avatar answered Nov 18 '22 11:11

Matt Dowle


Referring to Matt's answer there is an easy way to use all the 15 significant digits a double offers in order to properly select the closest matching row. Instead of working on the original values, one can scale the values up to ensure that the 15 significant digits lie above the 10^(-8) level. This could be done as follows:

orig_vals <- dt[,val]
scale_fact <- max(10^(trunc(log10(abs(orig_vals)))+8))
scaled_vals <- orig_vals * scale_fact
dt[,scaled_val:=scaled_vals]
setkey(dt,scaled_val)

Now, performing the rolling join

scaled_x <- x*scale_fact
dt[J(scaled_x), roll="nearest"][, ifelse(is.na(val), NA_real_, rnk)]

# [1] 0.2727273

yields - as desired - a single value.

If also in the case of two identical key values only one row should be selected, the mult="first" argument can be added to the above data.table call.

like image 32
Georg Schnabel Avatar answered Nov 18 '22 12:11

Georg Schnabel