Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

By group check if previous entry exists based on other column

Tags:

r

data.table

I would like to create a column in my data.table that is identical to another ID-column if there is a previous year entry of that same ID with respect to a third date-column.

My very inefficient solution:

library(data.table)
set.seed(123)
DT = data.table(
  ID = c("b","b","b","a","a","c"),
  dates = sample(seq(as.Date('2016/01/01'), as.Date('2019/01/01'), by="day"), 12)
)

setorder(DT, ID, dates)
DT[, Desired_Column:=DT[ID == .BY[[1]] & year(dates) < year(.BY[[2]]), ID[.N]], by=.(ID, dates)]

My questions: Why is that slow on big datasets and what would be a way to do it faster?

EDIT: the initial version did not capture the whole problem. I was suprised, that the filter year( dates ) > min( year( dates ) ) works by group, but it actually does not. I changed the dates-column, so that dates of year 2016 are possible. Now group a has no entry early than 2017, which should make the first entry of Desired_Column NA.

Here is the output I would like to get:

      ID      dates Desired_Column
 1:  a 2017-05-11           <NA>
 2:  a 2018-08-24              a
 3:  a 2018-10-24              a
 4:  a 2018-11-06              a
 5:  b 2016-11-11           <NA>
 6:  b 2017-03-23              b
 7:  b 2017-07-30              b
 8:  b 2017-08-23              b
 9:  b 2018-05-13              b
10:  b 2018-08-30              b
11:  c 2016-02-19           <NA>
12:  c 2017-05-07              c
like image 396
HannesZ Avatar asked Jan 26 '23 19:01

HannesZ


2 Answers

my approach

DT[ DT[, .I[ year(dates) > min(year(dates))], by = "ID"]$V1, Desired_Column := ID][]

#     ID      dates Desired_Column
#  1:  a 2017-05-11           <NA>
#  2:  a 2018-08-24              a
#  3:  a 2018-10-24              a
#  4:  a 2018-11-06              a
#  5:  b 2016-11-11           <NA>
#  6:  b 2017-03-23              b
#  7:  b 2017-07-30              b
#  8:  b 2017-08-23              b
#  9:  b 2018-05-13              b
# 10:  b 2018-08-30              b
# 11:  c 2016-02-19           <NA>
# 12:  c 2017-05-07              c

benchmarking

microbenchmark::microbenchmark( 
  my_solution = DT[ DT[, .I[ year( dates ) > min( year( dates ) ) ], by = "ID"]$V1, Desired_Column := ID][],
  your_solution = DT[, Desired_Column:=DT[ID == .BY[[1]] & year(dates) < year(.BY[[2]]), ID[.N]], by=.(ID, dates)][],
  akrun = {
    DT[, yr := year(dates)]
    DT[DT[, .(yr = first(yr)), ID],  Desired_Column := ID, on = .(ID, yr > yr)]
  }
)

# Unit: milliseconds
#          expr      min       lq     mean   median       uq       max neval
#   my_solution 1.349660 1.470769 1.670500 1.612211 1.836653  2.764091   100
# your_solution 4.317707 4.510213 4.877906 4.656327 4.893572 21.164655   100
#         akrun 3.637755 3.812187 4.320189 4.197804 4.675306 10.018512   100

and on a dataset of length 1,000

# Unit: milliseconds
#          expr        min         lq       mean     median         uq       max neval
#   my_solution   1.635860   1.787998   2.323437   2.038197   2.504854  10.82108   100
# your_solution 342.582218 352.706475 367.424500 359.987257 375.076633 467.85023   100
#         akrun   3.749825   4.291949   5.448715   4.949456   5.368815  39.72218   100

and on dataset of length 1,000,000

# Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
#   my_solution 270.8044 280.4150 324.1195 284.5502 390.1511 393.2282    10
# your_solution   - I did not dare to run ;-)
#         akrun 166.2049 167.8109 209.5945 178.2247 291.4220 297.0243    10

conclusion

My subsetting-answer works most efficient of data.tables up to about 50,000 rows, above that size the non-equi join-solution by @akrun is the performance winner.

like image 181
Wimpel Avatar answered Jan 30 '23 07:01

Wimpel


Here is an option with non-equi join. As the 'dates' column is already ordered, can subset the first 'year' grouped by 'ID' and use that in the non-equi self join for creating the 'Desired_Column', thereby avoiding the step to get the minimum value

DT[, yr := year(dates)]
DT[DT[, .(yr = first(yr)), ID],  Desired_Column := ID, on = .(ID, yr > yr)]
DT
#    ID      dates   yr Desired_Column
# 1:  a 2017-11-26 2017           <NA>
# 2:  a 2018-10-05 2018              a
# 3:  a 2018-11-15 2018              a
# 4:  a 2018-11-21 2018              a
# 5:  b 2017-07-30 2017           <NA>
# 6:  b 2017-10-26 2017           <NA>
# 7:  b 2018-01-18 2018              b
# 8:  b 2018-02-03 2018              b
# 9:  b 2018-07-30 2018              b
#10:  b 2018-10-09 2018              b
#11:  c 2017-02-03 2017           <NA>
#12:  c 2017-11-23 2017           <NA>
like image 35
akrun Avatar answered Jan 30 '23 09:01

akrun