I have a policyData which is my very huge data set (millions of rows) and I wish to add some info to it with a mapping table (tens of thousands of rows).
Sample:
policyData <- data.table(plan=c("c","b","b","d"),v=c(8,7,5,6),foo=c(4,2,8,3))
mapping <- data.table(plan=c("b","b","a","a","c","c"),a=c(1,2,4,5,7,8),b=c(9,8,6,5,3,2))
policyData:
plan v foo
1: c 8 4
2: b 7 2
3: b 5 8
4: d 6 3
mapping:
plan a b
1: b 1 9
2: b 2 8
3: a 4 6
4: a 5 5
5: c 7 3
6: c 8 2
Problem is that the mapping has multiple instances and I wish to get the first match only. And I need to join the two using memory efficient way using :=
.
The desired output is:
plan v foo a b
1: c 8 4 7 3
2: b 7 2 1 9
3: b 5 8 1 9
4: d 6 3 NA NA
I have tried:
policyData[mapping, on="plan", `:=`(a=i.a, b=i.b)]
which gives the last instance in the mapping table:
plan v foo a b
1: c 8 4 8 2
2: b 7 2 2 8
3: b 5 8 2 8
4: d 6 3 NA NA
I have also tried:
policyData[mapping, on="plan", `:=`(a=i.a, b=i.b), mult="first"]
which gives strange result (the second "b" can't match with the mapping):
plan v foo a b
1: c 8 4 8 2
2: b 7 2 2 8
3: b 5 8 NA NA
4: d 6 3 NA NA
Any insight would be helpful. I've done lots of searches already.
Just summarise mapping
with mapping[, .SD[1], by = plan]
and use that for joining:
policyData[mapping[, .SD[1], by = plan]
, on = .(plan)
, `:=` (a = i.a, b = i.b)]
which gives the desired output:
> policyData plan v foo a b 1: c 8 4 7 3 2: b 7 2 1 9 3: b 5 8 1 9 4: d 6 3 NA NA
Suggesting another alternative:
policyData[, c("a", "b") := mapping[.SD, on="plan", .(a, b), mult="first"]]
sample data to match OP's dimensions:
library(data.table)
set.seed(0L)
nrDS <- 100e6
nrMap <- 90e3
policyData <- data.table(plan=sample(letters,nrDS,TRUE),v=rnorm(nrDS),foo=rnorm(nrDS))
mapping <- data.table(plan=sample(letters,nrMap,TRUE),a=rnorm(nrMap),b=rnorm(nrMap))
memory profiling:
library(bench)
mark(mtd1=policyData[mapping[mapping[, .I[1L], by = plan]$V1], on = .(plan), `:=` (a = i.a, b = i.b)],
mtd2=policyData[, c("a", "b") := mapping[.SD, on="plan", .(a, b), mult="first"]],
mtd3=policyData[unique(mapping, by="plan"), on=.(plan), `:=` (a=i.a, b=i.b)])
memory profiling output:
# A tibble: 3 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 mtd1 7.07s 7.07s 7.07s 7.07s 0.141 4.74GB 0 1 7.07s <data.table [90,000,000 x 5~ <Rprofmem [31,589 x 3]> <bch:t~ <tibble [1 x 3~
2 mtd2 6.73s 6.73s 6.73s 6.73s 0.149 5.03GB 1 1 6.73s <data.table [90,000,000 x 5~ <Rprofmem [20 x 3]> <bch:t~ <tibble [1 x 3~
3 mtd3 7.68s 7.68s 7.68s 7.68s 0.130 3.35GB 1 1 7.68s <data.table [90,000,000 x 5~ <Rprofmem [23 x 3]> <bch:t~ <tibble [1 x 3~
Hugh method is the most memory efficient while mtd2 is fastest. As with most things in life, you need to make a tradeoff.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With