I have been playing around with the newer data.table
conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig
and dtSmall
, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max
or min
for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
I next want to perform a merge by ID
and needs to merge only where ValueSmall
is greater than or equal to ValueBig
. For the matches, I want to grab the max
ranked value in dtBig
. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table
way of grabbing the max value in another data.table
with multiple matches?
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With