I would like to conditionally join two data tables together: <pre class="prettyprint"><code>library(data.table) set.seed(1) key.table <- data.table( out = (0:10)/10, keyz = sort(runif(11)) ) large.tbl <- data.table( ab = rnorm(1e6), cd = runif(1e6) ) </code></pre> according to the following rule: match the smallest value of <code>out</code> in <code>key.table</code> whose <code>keyz</code> value is larger than <code>cd</code>. I have the following: <pre class="prettyprint"><code>library(dplyr) large.tbl %>% rowwise %>% mutate(out = min(key.table$out[key.table$keyz > cd])) </code></pre> which provides the correct output. The problem I have is that the <code>rowwise</code> operation seems expensive for the <code>large.tbl</code> I am actually using, crashing it unless it is on a particular computer. Are there less memory-expensive operations? The following seems slightly faster, but not enough for the problem I have. <pre class="prettyprint"><code>large.tbl %>% group_by(cd) %>% mutate(out = min(key.table$out[key.table$keyz > cd])) </code></pre> This smells like a problem with a <code>data.table</code> answer, but the answer does not have to use that package.

What you want is: <pre class="prettyprint"><code>setkey(large.tbl, cd) setkey(key.table, keyz) key.table[large.tbl, roll = -Inf] </code></pre> See <code>?data.table</code>><code>roll</code>: <blockquote> Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If <code>roll=TRUE</code> and <code>i</code>'s row matches to all but the last <code>x</code> join column, and its value in the last <code>i</code> join column falls in a gap (including after the last observation in <code>x</code> for that group), then the prevailing value in <code>x</code> is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in <code>x</code>'s key, the last key column is a date (or time, or datetime) and all the columns of <code>x</code>'s key are joined to. A common idiom is to select a contemporaneous regular time series (<code>dts</code>) across a set of identifiers (<code>ids</code>): <code>DT[CJ(ids,dts),roll=TRUE]</code> where <code>DT</code> has a 2-column key (<code>id,date</code>) and <code>CJ</code> stands for cross join. When <code>roll</code> is a positive number, this limits how far values are carried forward. <code>roll=TRUE</code> is equivalent to <code>roll=+Inf</code>. When <code>roll</code> is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use <code>-Inf</code> for unlimited roll back. When roll is <code>"nearest"</code>, the nearest value is joined to. </blockquote> (to be fair I think this could go for some elucidation, it's pretty dense)

Conditional join in r

I would like to conditionally join two data tables together:

library(data.table)
set.seed(1)

key.table <- 
  data.table(
    out = (0:10)/10,
    keyz = sort(runif(11))
  )

large.tbl <- 
  data.table(
    ab = rnorm(1e6),
    cd = runif(1e6)
  )

according to the following rule: match the smallest value of out in key.table whose keyz value is larger than cd. I have the following:

library(dplyr)
large.tbl %>%
  rowwise %>%
  mutate(out = min(key.table$out[key.table$keyz > cd]))

which provides the correct output. The problem I have is that the rowwise operation seems expensive for the large.tbl I am actually using, crashing it unless it is on a particular computer. Are there less memory-expensive operations? The following seems slightly faster, but not enough for the problem I have.

large.tbl %>%
    group_by(cd) %>%
    mutate(out = min(key.table$out[key.table$keyz > cd]))

This smells like a problem with a data.table answer, but the answer does not have to use that package.

What is conditional join?

Using conditional JOIN syntax, you can establish joins based on conditions other than equality between fields. In addition, the host and cross-referenced join fields do not have to contain matching formats, and the cross-referenced field does not have to be indexed.

Can you join tables in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens.

What is difference between join and merge in R?

The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.

What R package is left join in?

Method 2: Using left_join This performs left join on two dataframes which are available in dplyr() package.

If key.table$out is also sorted as is in your toy example, following would work

ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
large.tbl$out <- key.table$out[ind]
head(large.tbl)
#             ab         cd out
#1: -0.928567035 0.99473795  NA
#2: -0.294720447 0.41107393 0.5
#3: -0.005767173 0.91086585 1.0
#4:  2.404653389 0.66491244 0.8
#5:  0.763593461 0.09590456 0.1
#6: -0.799009249 0.50963409 0.5

If key.table$out is not sorted,

ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
vec <- rev(cummin(rev(key.table$out)))
large.tbl$out <- vec[ind]

What you want is:

setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]

See ?data.table>roll:

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join. When roll is a positive number, this limits how far values are carried forward. roll=TRUE is equivalent to roll=+Inf. When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back. When roll is "nearest", the nearest value is joined to.

(to be fair I think this could go for some elucidation, it's pretty dense)

Conditional join in r

Tags:

join

r

dplyr

Hugh

People also ask

2 Answers

Khashaa

MichaelChirico

Recent Activity

Donate For Us

Conditional join in r

Tags:

join

r

dplyr

Hugh

People also ask

2 Answers

Khashaa

MichaelChirico

Related questions

Recent Activity

Donate For Us