I would like to create a column in my data.table that is identical to another ID-column if there is a previous year entry of that same ID with respect to a third date-column.
My very inefficient solution:
library(data.table)
set.seed(123)
DT = data.table(
ID = c("b","b","b","a","a","c"),
dates = sample(seq(as.Date('2016/01/01'), as.Date('2019/01/01'), by="day"), 12)
)
setorder(DT, ID, dates)
DT[, Desired_Column:=DT[ID == .BY[[1]] & year(dates) < year(.BY[[2]]), ID[.N]], by=.(ID, dates)]
My questions: Why is that slow on big datasets and what would be a way to do it faster?
EDIT: the initial version did not capture the whole problem. I was suprised, that the filter year( dates ) > min( year( dates ) )
works by group, but it actually does not. I changed the dates
-column, so that dates of year 2016
are possible. Now group a
has no entry early than 2017
, which should make the first entry of Desired_Column
NA
.
Here is the output I would like to get:
ID dates Desired_Column
1: a 2017-05-11 <NA>
2: a 2018-08-24 a
3: a 2018-10-24 a
4: a 2018-11-06 a
5: b 2016-11-11 <NA>
6: b 2017-03-23 b
7: b 2017-07-30 b
8: b 2017-08-23 b
9: b 2018-05-13 b
10: b 2018-08-30 b
11: c 2016-02-19 <NA>
12: c 2017-05-07 c
my approach
DT[ DT[, .I[ year(dates) > min(year(dates))], by = "ID"]$V1, Desired_Column := ID][]
# ID dates Desired_Column
# 1: a 2017-05-11 <NA>
# 2: a 2018-08-24 a
# 3: a 2018-10-24 a
# 4: a 2018-11-06 a
# 5: b 2016-11-11 <NA>
# 6: b 2017-03-23 b
# 7: b 2017-07-30 b
# 8: b 2017-08-23 b
# 9: b 2018-05-13 b
# 10: b 2018-08-30 b
# 11: c 2016-02-19 <NA>
# 12: c 2017-05-07 c
benchmarking
microbenchmark::microbenchmark(
my_solution = DT[ DT[, .I[ year( dates ) > min( year( dates ) ) ], by = "ID"]$V1, Desired_Column := ID][],
your_solution = DT[, Desired_Column:=DT[ID == .BY[[1]] & year(dates) < year(.BY[[2]]), ID[.N]], by=.(ID, dates)][],
akrun = {
DT[, yr := year(dates)]
DT[DT[, .(yr = first(yr)), ID], Desired_Column := ID, on = .(ID, yr > yr)]
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_solution 1.349660 1.470769 1.670500 1.612211 1.836653 2.764091 100
# your_solution 4.317707 4.510213 4.877906 4.656327 4.893572 21.164655 100
# akrun 3.637755 3.812187 4.320189 4.197804 4.675306 10.018512 100
and on a dataset of length 1,000
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_solution 1.635860 1.787998 2.323437 2.038197 2.504854 10.82108 100
# your_solution 342.582218 352.706475 367.424500 359.987257 375.076633 467.85023 100
# akrun 3.749825 4.291949 5.448715 4.949456 5.368815 39.72218 100
and on dataset of length 1,000,000
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_solution 270.8044 280.4150 324.1195 284.5502 390.1511 393.2282 10
# your_solution - I did not dare to run ;-)
# akrun 166.2049 167.8109 209.5945 178.2247 291.4220 297.0243 10
conclusion
My subsetting-answer works most efficient of data.tables up to about 50,000 rows, above that size the non-equi join-solution by @akrun is the performance winner.
Here is an option with non-equi join. As the 'dates' column is already ordered, can subset the first
'year' grouped by 'ID' and use that in the non-equi self join for creating the 'Desired_Column', thereby avoiding the step to get the min
imum value
DT[, yr := year(dates)]
DT[DT[, .(yr = first(yr)), ID], Desired_Column := ID, on = .(ID, yr > yr)]
DT
# ID dates yr Desired_Column
# 1: a 2017-11-26 2017 <NA>
# 2: a 2018-10-05 2018 a
# 3: a 2018-11-15 2018 a
# 4: a 2018-11-21 2018 a
# 5: b 2017-07-30 2017 <NA>
# 6: b 2017-10-26 2017 <NA>
# 7: b 2018-01-18 2018 b
# 8: b 2018-02-03 2018 b
# 9: b 2018-07-30 2018 b
#10: b 2018-10-09 2018 b
#11: c 2017-02-03 2017 <NA>
#12: c 2017-11-23 2017 <NA>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With