I would like to conditionally join two data tables together:
library(data.table)
set.seed(1)
key.table <-
data.table(
out = (0:10)/10,
keyz = sort(runif(11))
)
large.tbl <-
data.table(
ab = rnorm(1e6),
cd = runif(1e6)
)
according to the following rule: match the smallest value of out
in key.table
whose keyz
value is larger than cd
. I have the following:
library(dplyr)
large.tbl %>%
rowwise %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
which provides the correct output. The problem I have is that the rowwise
operation seems expensive for the large.tbl
I am actually using, crashing it unless it is on a particular computer. Are there less memory-expensive operations? The following seems slightly faster, but not enough for the problem I have.
large.tbl %>%
group_by(cd) %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
This smells like a problem with a data.table
answer, but the answer does not have to use that package.
Using conditional JOIN syntax, you can establish joins based on conditions other than equality between fields. In addition, the host and cross-referenced join fields do not have to contain matching formats, and the cross-referenced field does not have to be indexed.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens.
The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.
Method 2: Using left_join This performs left join on two dataframes which are available in dplyr() package.
If key.table$out
is also sorted as is in your toy example, following would work
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
large.tbl$out <- key.table$out[ind]
head(large.tbl)
# ab cd out
#1: -0.928567035 0.99473795 NA
#2: -0.294720447 0.41107393 0.5
#3: -0.005767173 0.91086585 1.0
#4: 2.404653389 0.66491244 0.8
#5: 0.763593461 0.09590456 0.1
#6: -0.799009249 0.50963409 0.5
If key.table$out
is not sorted,
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
vec <- rev(cummin(rev(key.table$out)))
large.tbl$out <- vec[ind]
What you want is:
setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]
See ?data.table
>roll
:
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
roll=TRUE
andi
's row matches to all but the lastx
join column, and its value in the lasti
join column falls in a gap (including after the last observation inx
for that group), then the prevailing value inx
is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates inx
's key, the last key column is a date (or time, or datetime) and all the columns ofx
's key are joined to. A common idiom is to select a contemporaneous regular time series (dts
) across a set of identifiers (ids
):DT[CJ(ids,dts),roll=TRUE]
whereDT
has a 2-column key (id,date
) andCJ
stands for cross join. Whenroll
is a positive number, this limits how far values are carried forward.roll=TRUE
is equivalent toroll=+Inf
. Whenroll
is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use-Inf
for unlimited roll back. When roll is"nearest"
, the nearest value is joined to.
(to be fair I think this could go for some elucidation, it's pretty dense)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With