Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up vector-lookup for data.table

Tags:

r

data.table

Per row in a data.table, I need to find the closest lower number from a vector. The following minimum working example does the job, but is much too slow, especially for longer pre.numbers vectors (about 1 million elements in the real data).

library(data.table)
set.seed(2)
pre.numbers <- sort(floor(runif(50000, 1, 1000000)))
the.table <- data.table(cbind(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)), closest.lower.number=NA_integer_))
setkey(the.table, rowid)
the.table[, closest.lower.number:=max(pre.numbers[pre.numbers<current.number]), by=rowid]

There must be a smarter way to do it. There is no relationship between the vector-numbers and the numbers in the data.table.

like image 343
Chris Avatar asked Dec 01 '25 04:12

Chris


1 Answers

How about this? Using data.table's rolling joins:

DT = data.table(pre = pre.numbers, 
       current.number = pre.numbers+0.5, key="current.number")
setkey(the.table, current.number)
ans = DT[the.table, roll=Inf, rollends=FALSE]

Since you're dealing with integers, I've just added 0.5 (any number between 0 and 1 should be fine) to create DT from pre.numbers.

The last step performs a LOCF rolling join (last observation carried forward). For each value of current.number (key column), matching rows are looked for in DT's current.number (key column). If there's no match, the last observation is rolled forward. And if the match occurs at the start/end, it results in NA (rollends = FALSE).

To illustrate better as to what happens, consider this case:

# pre.numbers:
# c(10, 11)

# the.table:
# current.numbers
#               9
#              10
#              11

We convert pre.numbers to DT first, which'll result in the columns

# DT:
# pre  current.numbers (key col)
#  10             10.5
#  11             11.5

For each value in the.table:

#  9 -> falls before 10.5, LOCF and rollends = FALSE => result is NA
# 10 -> falls before 10.5 => same as above
# 11 -> falls between 10.5 and 11.5, LOCF matches with previous row = 10.5 
#       corresponding pre = 10.

HTH


Here's the code I used to generate the data:

require(data.table)
set.seed(1L)
pre.numbers = sort(floor(runif(50000, 1, 1000000)))
the.table = data.table(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)))
like image 169
Arun Avatar answered Dec 04 '25 20:12

Arun