I would like to conditionally join two data tables together:
library(data.table)
set.seed(1)
key.table <- 
  data.table(
    out = (0:10)/10,
    keyz = sort(runif(11))
  )
large.tbl <- 
  data.table(
    ab = rnorm(1e6),
    cd = runif(1e6)
  )
according to the following rule: match the smallest value of out in key.table whose keyz value is larger than cd. I have the following:
library(dplyr)
large.tbl %>%
  rowwise %>%
  mutate(out = min(key.table$out[key.table$keyz > cd]))
which provides the correct output. The problem I have is that the rowwise operation seems expensive for the large.tbl I am actually using, crashing it unless it is on a particular computer. Are there less memory-expensive operations? The following seems slightly faster, but not enough for the problem I have.
large.tbl %>%
    group_by(cd) %>%
    mutate(out = min(key.table$out[key.table$keyz > cd]))
This smells like a problem with a data.table answer, but the answer does not have to use that package.
Using conditional JOIN syntax, you can establish joins based on conditions other than equality between fields. In addition, the host and cross-referenced join fields do not have to contain matching formats, and the cross-referenced field does not have to be indexed.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens.
The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.
Method 2: Using left_join This performs left join on two dataframes which are available in dplyr() package.
If key.table$out is also sorted as is in your toy example, following would work
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
large.tbl$out <- key.table$out[ind]
head(large.tbl)
#             ab         cd out
#1: -0.928567035 0.99473795  NA
#2: -0.294720447 0.41107393 0.5
#3: -0.005767173 0.91086585 1.0
#4:  2.404653389 0.66491244 0.8
#5:  0.763593461 0.09590456 0.1
#6: -0.799009249 0.50963409 0.5
If key.table$out is not sorted,
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
vec <- rev(cummin(rev(key.table$out)))
large.tbl$out <- vec[ind]
                        What you want is:
setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]
See ?data.table>roll:
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
roll=TRUEandi's row matches to all but the lastxjoin column, and its value in the lastijoin column falls in a gap (including after the last observation inxfor that group), then the prevailing value inxis rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates inx's key, the last key column is a date (or time, or datetime) and all the columns ofx's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids):DT[CJ(ids,dts),roll=TRUE]whereDThas a 2-column key (id,date) andCJstands for cross join. Whenrollis a positive number, this limits how far values are carried forward.roll=TRUEis equivalent toroll=+Inf. Whenrollis a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use-Inffor unlimited roll back. When roll is"nearest", the nearest value is joined to.
(to be fair I think this could go for some elucidation, it's pretty dense)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With