Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient. <h3>Example</h3> <pre class="prettyprint"><code>library(data.table) # initial data DT = data.table(x = c(1,1,1,2,2,1,1,2,2), y = c(1,3,6)) # data to merge DTx <- data.table(x = 1:3, y = 1, k = "X") # regular update join copy(DT)[DTx, on = .(x, y), k := i.k][] #> x y k #> 1: 1 1 X #> 2: 1 3 <NA> #> 3: 1 6 <NA> #> 4: 2 1 X #> 5: 2 3 <NA> #> 6: 1 6 <NA> #> 7: 1 1 X #> 8: 2 3 <NA> #> 9: 2 6 <NA> # DTx remains the same DTx #> x y k #> 1: 1 1 X #> 2: 2 1 X #> 3: 3 1 X </code></pre> <h3>What I usually do:</h3> <pre class="prettyprint"><code># set an Id variable DTx[, Id := .I] # assign the Id in merge DT[DTx, on = .(x, y), `:=`(k = i.k, matched_id = i.Id)][] #> x y k matched_id #> 1: 1 1 X 1 #> 2: 1 3 <NA> NA #> 3: 1 6 <NA> NA #> 4: 2 1 X 2 #> 5: 2 3 <NA> NA #> 6: 1 6 <NA> NA #> 7: 1 1 X 1 #> 8: 2 3 <NA> NA #> 9: 2 6 <NA> NA # use matched_id to find merged rows DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)] DTx #> x y k Id matched #> 1: 1 1 X 1 TRUE #> 2: 2 1 X 2 TRUE #> 3: 3 1 X 3 FALSE </code></pre>

Following Jan's comment: <blockquote> This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables. </blockquote> You can pull the indices: <pre class="prettyprint"><code>merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI] x y irow xrow 1: 1 1 1 1 2: 1 1 1 7 3: 2 1 2 4 4: 3 1 3 0 </code></pre> Then apply edits to each table using indices rather than merging or matching a second time: <pre class="prettyprint"><code>rowDT = merge_metaDT[xrow != 0L] DT[rowDT$xrow, k := DTx[rowDT$irow, k]] DTx[, matched := FALSE][rowDT$irow, matched := TRUE] </code></pre> How it works: <ul> <li>When joining, <code>x[i]</code>, the symbol <code>.I</code> indexes rows of <code>x</code> </li> <li>When grouping in a join with <code>by=.EACHI</code>, <code>.GRP</code> indexes each group, which means each row of <code>i</code> here</li> <li>We drop the non-matching values of <code>.I</code> which are coded as zeros</li> </ul> On this last point, we might expect NAs instead of zeros, as returned by <code>DT[DTx, on=.(x, y), which=TRUE]</code>. I'm not sure why these differ. <hr> <blockquote> Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient. </blockquote> I expect this is more efficient than multiple merges or <code>%in%</code> when the merge is costly enough. It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow. Update logic is already complex in base R, with multiple edits on a single index allowed: <pre class="prettyprint"><code>> x = c(1, 2, 3) > x[c(1, 1)] = c(4, 5) > x [1] 5 2 3 </code></pre> And there is the question of how to match and edit multiple indices at once: <pre class="prettyprint"><code>> x = c(1, 1, 3) > x[match(c(1, 3), x)] = c(4, 5) > x [1] 4 1 5 </code></pre> In data.table updates, the latter issue is handled with <code>mult=</code>. In the update-two-tables use case, these questions would get much more complicated.

How to update both data.tables in a join

Tags:

merge

r

data.table

Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient.

Example

Click to copy

library(data.table)

# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2), 
                y = c(1,3,6))

# data to merge
DTx <- data.table(x = 1:3,
                  y = 1,
                  k = "X")

# regular update join
copy(DT)[DTx,
         on = .(x, y),
         k := i.k][]
#>    x y    k
#> 1: 1 1    X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1    X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1    X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>

# DTx remains the same
DTx
#>    x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X

What I usually do:

Click to copy

# set an Id variable
DTx[, Id := .I]

# assign the Id in merge
DT[DTx,
   on = .(x, y),
   `:=`(k = i.k,
        matched_id = i.Id)][]
#>    x y    k matched_id
#> 1: 1 1    X          1
#> 2: 1 3 <NA>         NA
#> 3: 1 6 <NA>         NA
#> 4: 2 1    X          2
#> 5: 2 3 <NA>         NA
#> 6: 1 6 <NA>         NA
#> 7: 1 1    X          1
#> 8: 2 3 <NA>         NA
#> 9: 2 6 <NA>         NA

# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#>    x y k Id matched
#> 1: 1 1 X  1    TRUE
#> 2: 2 1 X  2    TRUE
#> 3: 3 1 X  3   FALSE

642

asked Dec 02 '21 16:12

mnist

1 Answers

Following Jan's comment:

This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.

You can pull the indices:

Click to copy

merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]

   x y irow xrow
1: 1 1    1    1
2: 1 1    1    7
3: 2 1    2    4
4: 3 1    3    0

Then apply edits to each table using indices rather than merging or matching a second time:

Click to copy

rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]

How it works:

When joining, x[i], the symbol .I indexes rows of x
When grouping in a join with by=.EACHI, .GRP indexes each group, which means each row of i here
We drop the non-matching values of .I which are coded as zeros

On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]. I'm not sure why these differ.

Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient.

I expect this is more efficient than multiple merges or %in% when the merge is costly enough.

It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.

Update logic is already complex in base R, with multiple edits on a single index allowed:

Click to copy

> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3

And there is the question of how to match and edit multiple indices at once:

Click to copy

> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5

In data.table updates, the latter issue is handled with mult=. In the update-two-tables use case, these questions would get much more complicated.

answered Sep 30 '22 17:09

Frank

Related questions
                            
                                How to determine the number of possible combinations of letters that contain a degenerate substring
                            
                                R shiny future: plan(multiprocess)/plan(multicore) + Kill long running process
                            
                                Temp files automatically deleted in R shiny app - Error in file: cannot open the connection
                            
                                Lowering infix operator precedence in R?
                            
                                GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names
                            
                                ggplot2 custom stat not shown when facetting
                            
                                How to fit a circle with a given radius to sample data points
                            
                                ML-Flow installation in R
                            
                                Why is matrix product slower when matrix has very small values?
                            
                                is there a way to list all environments (environment names) in R
                            
                                Authenticate at Github via Travis-CI using httr as well as locally (local works, remote doesn't)
                            
                                How to find all possible "continuous" paths of a matrix / network / graph in R
                            
                                Identify consecutive sequences based on a given variable
                            
                                dice roll math with large n (>100)
                            
                                Use of other columns as arguments to function in summarize_if()
                            
                                Problems with hierarchical modelling/reconciliation in tidyverts
                            
                                ggplot2 facet_grid with facet titles
                            
                                across function not found in dplyr package [duplicate]
                            
                                `vec_arith` not called as expected
                            
                                Centrality calculations from multiple TPMs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to update both data.tables in a join

Tags:

merge

r

data.table

Example

What I usually do:

mnist

People also ask

1 Answers

Frank

Recent Activity

Donate For Us