Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional data.table merge with .EACHI

Tags:

r

data.table

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.

Set up environment

## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")

Create two fake datasets

A create a "big" table with 50 rows (10 values for each ID).

library(data.table)
set.seed(1L)

# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]

    ID ValueBig Rank
 1:  A      266    3
 2:  A      373    4
 3:  A      573    5
 4:  A      909    9
 5:  A      202    2
---                 
46:  E      790    9
47:  E       24    1
48:  E      478    2
49:  E      733    7
50:  E      693    6

Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)

dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))

    ID ValueSmall
 1:  A        478
 2:  A        862
 3:  B        439
 4:  B        245
 5:  C         71
 6:  C        100
 7:  D        317
 8:  D        519
 9:  E        663
10:  E        407

Merge

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.

## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]

## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]

Results

    ID ValueSmall RankSmall RankSmall2 DesiredRank
 1:  A        478         1          4           4
 2:  A        862         1          7           7
 3:  B        439         3          4           4
 4:  B        245         1          2           2
 5:  C         71         1          1           1
 6:  C        100         1          1           1
 7:  D        317         1          2           2
 8:  D        519         3          5           5
 9:  E        663         2          5           5
10:  E        407         1          1           1

Is there a better data.table way of grabbing the max value in another data.table with multiple matches?

like image 820
Mike.Gahan Avatar asked Apr 02 '17 22:04

Mike.Gahan


1 Answers

I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.

setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]

    ID ValueSmall r
 1:  A        478 4
 2:  A        862 7
 3:  B        439 4
 4:  B        245 2
 5:  C         71 1
 6:  C        100 1
 7:  D        317 2
 8:  D        519 5
 9:  E        663 5
10:  E        407 1

I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.


Is there a way to aggregate these matches using a function like max or min for these multiple matches?

For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...

dtSmall[, r :=
  dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
like image 188
Frank Avatar answered Oct 24 '22 03:10

Frank