Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set column value using the first next row in the same group that meets a condition

Tags:

r

data.table

I am new to R and this is my first question on stackoverflow.

I am trying

  • to assign by reference to a new column
  • for each row
  • using the value from the first next row within the same group of rows
  • that meets a condition.

Example data:

    id code  date_down    date_up
 1:  1    p 2019-01-01 2019-01-02
 2:  1    f 2019-01-02 2019-01-03
 3:  2    f 2019-01-02 2019-01-02
 4:  2    p 2019-01-03       <NA>
 5:  3    p 2019-01-04       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05
 7:  5    f 2019-01-07 2019-01-08
 8:  5    p 2019-01-07 2019-01-08
 9:  5    p 2019-01-09 2019-01-09
10:  6    f 2019-01-10 2019-01-10
11:  6    p 2019-01-10 2019-01-10
12:  6    p 2019-01-10 2019-01-11

What I would like to do is

  • subset (group) by id
  • and for each row
  • find date_up for the first row further down,
  • where code = 'p' and date-up (of the row found) is greater than date-down for the row I am updating.

My expected result shall be:

    id code  date_down    date_up  founddate
 1:  1    p 2019-01-01 2019-01-02       <NA>
 2:  1    f 2019-01-02 2019-01-03       <NA>
 3:  2    f 2019-01-02 2019-01-02       <NA>
 4:  2    p 2019-01-03       <NA>       <NA>
 5:  3    p 2019-01-04       <NA>       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05       <NA>
 7:  5    f 2019-01-07 2019-01-08 2019-01-08
 8:  5    p 2019-01-07 2019-01-08 2019-01-09
 9:  5    p 2019-01-09 2019-01-09       <NA>
10:  6    f 2019-01-10 2019-01-10 2019-01-11
11:  6    p 2019-01-10 2019-01-10 2019-01-11
12:  6    p 2019-01-10 2019-01-11       <NA>

I have tried many variants, using .SD, .N, creating a new column with DT[, idcount:= seq_leg(.N),by=id], but not really got anywhere. Any help greatly appreciated.

Also any good references to data.table :) Many thanks

Edit: I have edited the original data supplied to give a more subtle example, whereby row 10 is updated with data from row 12, because row 12 is in the id subset and meets the qualifying criteria. Row 11 does not meet the qualifying criteria and hence the data is not used to update row 10. Also included my first use of dput!

Example data as dput code:

dt <- structure(list(
id        = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
code      = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
date_up   = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
class     = c("data.table", "data.frame"),
row.names = c(NA, -12L))
setDT(dt)  # to reinit the internal self ref pointer (known issue)
like image 875
OllieB Avatar asked Mar 08 '19 20:03

OllieB


2 Answers

Join a data.table to a subset of itself, by group, to get values from rows matching non-equal criteria.

Summary:

  • Below I show 5 working data.table solutions that were candidates to performance test against the OP's actual data set (1.4M records).

  • All 5 solutions use "non-equi" joins (using inequality to compare columns for the join) in the on clause.

  • Each solution is just a small progressive code change so it should be easy to follow along to compare different data.table options and syntax choices.

Approach

To work through data.table syntax for this I broke it into to the following steps for the OP's problem:

  1. Join the dt to a subset of itself (or another data.table for that matter).
  2. Select (and rename) the columns you want from either dt or the subset.
  3. Define the join criteria based on columns from dt compared to columns in the subset, including using "non-equi" (non-equal) comparisons.
  4. Optionally define whether first or last match should be selected when multiple matching records are found in the subset.

Solution 1:

# Add row numbers to all records in dt (only because you 
# have criteria based on comparing sequential rows):
dt[, row := .I] 

# Compute result columns (  then standard assignment into dt using <-  )
dt$found_date  <- 
            dt[code=='p'][dt,   # join dt to the data.table matching your criteria, in this case dt[code=='p']
                          .( x.date_up ),   # columns to select, x. prefix means columns from dt[code=='p'] 
                          on = .(id==id, row > row, date_up > date_down),   # join criteria: dt[code=='p'] fields on LHS, main dt fields on RHS
                          mult = "first"]   # get only the first match if multiple matches

Note in the join expressions above:

  • i in this case is your main dt. This way you get all records from your main data.table.
  • x is the subset (or any other data.table) from which you want to find matching values.

Result matches requested output:

dt

    id code  date_down    date_up row found_date
 1:  1    p 2019-01-01 2019-01-02   1       <NA>
 2:  1    f 2019-01-02 2019-01-03   2       <NA>
 3:  2    f 2019-01-02 2019-01-02   3       <NA>
 4:  2    p 2019-01-03       <NA>   4       <NA>
 5:  3    p 2019-01-04       <NA>   5       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05   6       <NA>
 7:  5    f 2019-01-07 2019-01-08   7 2019-01-08
 8:  5    p 2019-01-07 2019-01-08   8 2019-01-09
 9:  5    p 2019-01-09 2019-01-09   9       <NA>
10:  6    f 2019-01-10 2019-01-10  10 2019-01-11
11:  6    p 2019-01-10 2019-01-10  11 2019-01-11
12:  6    p 2019-01-10 2019-01-11  12       <NA>

Note: You may remove the row column by doing dt[, row := NULL] if you like.

Solution 2:

Identical logic as above to join and find the result columns, but now using "assign by reference" := to create found_date in dt:

dt[, row := .I] # add row numbers (as in all the solutions)

# Compute result columns (  then assign by reference into dt using :=  

# dt$found_date  <- 
dt[, found_date :=   # assign by reference to dt$found_date 
            dt[code=='p'][dt, 
                          .( x.date_up ), 
                          on = .(id==id, row > row, date_up > date_down),
                          mult = "first"]]

In Solution 2, the slight variation to assign our results "by reference" into dt should be more efficient than Solution 1. Solution 1 calculated results the exact same way - the only difference is Solution 1 used standard assignment <- to create dt$found_date (less efficient).

Solution 3:

Like Solution 2 but now using .(.SD) in place of dt to refer to the original dt without naming it directly.

dt[, row := .I] # add row numbers (as in all the solutions)
setkey(dt, id, row, date_down)  #set key for dt 

# For all rows of dt, create found_date by reference :=
dt[, found_date := 
            # dt[code=='p'][dt, 
            dt[code=='p'][.(.SD),   # our subset (or another data.table), joined to .SD (referring to original dt)
                          .( x.date_up ), 
                          on = .(id==id, row > row, date_up > date_down),  
                          mult = "first"] ]  

.SD above references back to the original dt that we are assigning back into. It corresponds to the subset of data.table that contains the rows selected in the first dt[, which is all the rows because we didn't filter it.

Note: In Solution 3 I used setkey() to set the key. I should have done that in Solution 1 & Solution 2 - however I didn't want to change those solutions after @OllieB tested them successfully.

Solution 4:

Like Solution 3 but using .SD once more than previously. Our main data.table name dt now appears only once across our entire expression!

# add row column and setkey() as previous solutions

dt[, found_date :=
            # dt[code=='p'][.(.SD), 
            .SD[code=='p'][.SD,   # .SD in place of dt at left!  Also, removed .() at right (not sure on this second change)
                           .(found_date = x.date_up),
                           on = .(id==id, row > row, date_up > date_down),
                           mult = "first"]]

With the change above our data.table name dt appears only once. I like that a lot because it makes it easy to copy, adapt and reuse elsewhere.

Also note: Where I'd previously used .(SD) I've now removed the .() around .SD because it doesn't appear to require it. However for that change I'm not sure if it has any performance benefit or whether it's data.table preferred syntax. I would be grateful if anyone can add a comment to advise on that point.

Solution 5:

Like previous solutions but making use of by to explicitly group subsets over operations when joining

# add row column and setkey() as previous solutions

dt[, found_date :=
       .SD[code=='p'][.SD,
                      .(found_date = x.date_up),
                      # on = .(id==id, row > row, date_up > date_down),
                      on = .(row > row, date_up > date_down),  # removed the id column from here
                      mult = "first"]
   , by = id]   # added by = id to group the .SD subsets 

On this last solution I changed it to use the by clause to explicitly group the .SD subsets on id.

Note: Solution 5 did not perform well against OllieB's actual data compared to Solutions 1 - 4. However, testing my own mock data I found that Solution 5 could perform well when the number of unique groups from the id column were low:
- With only 6 groups in 1.5M records this solution worked just as fast as the others.
- With 40k groups in 1.5M records I saw similar poor performance as OllieB reported.

Results

Solutions 1 - 4 performed well:

  • For 1.45M records in OllieB's actual data each of Solutions 1 to 4 were all 2.42 seconds or less "elapsed" time according to OllieB's feedback. Solution 3 appears worked fastest for OllieB having "elapsed=1.22" seconds.

  • I personally prefer Solution 4 because of the simpler syntax.

Solution 5

  • Solution 5 (using by clause) performed poorly taking 577 seconds for OllieB's testing on his real data.

Versions used

data.table version: 1.12.0

R version 3.5.3 (2019-03-11)


Possible further improvements:

  • Changing the date fields to integer may help join more efficiently. See as.IDate() to convert dates to integer in data.tables.
  • The setkey() step may no longer bee needed: As explained here by @Arun due to on envoking [often] more efficient secondary indicies and auto indexing.

References to data.table

As part of your question you've asked for "any good references to data.table". I've found the following helpful:

  • data.table Getting started Wiki on GitHub is the place to start.

  • In particular for this problem it's worth reading:

    • What does .SD stand for in data.table in R
    • The HTML vignette for Secondary indices and auto indexing

Importantly note this answer by @Arun which explains "the reason for implementing on= argument" suggests it may no longer be necessary to set keys any more:

It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

In most cases therefore, there shouldn't be a need to set keys any more. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.

  • This SO question seems to be the hub of information about the different data.table joins: How to join (merge) data frames (inner, outer, left, right)?

  • Finally, data.table cheat sheet is an excellent reference (from link found on the data.table Getting started Wiki on GitHub).


As always I'm grateful if anyone has suggestions as perhaps this can be improved further.

Please feel free to comment, correct or post other solutions if you can add anything.

like image 175
krads Avatar answered Oct 06 '22 00:10

krads


A Not-the-Data-Table-Way approach:

> df <- structure(list(
+   id        = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
+   code      = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
+   date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
+   date_up   = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
+   class     = c("data.frame"),
+   row.names = c(NA, -12L))
> 
> 
> Lista <- lapply(split(df, df$id), function(x){
+   x$founddate <- 
+     sapply(c(1:nrow(x)), function(y){
+       na.omit(sapply(y:nrow(x), function(i){
+         ifelse(x[i + 1, "code"] == "p" & x[i + 1, "date_up"] > x[y, "date_down"],
+                x[i + 1, "date_up"], NA)
+       }))[1]
+     })
+   x$founddate <- as.Date(x$founddate, origin = "1970-01-01")
+   return(x)
+ })
> 
> 
> df <- do.call(rbind.data.frame, Lista)
> 
> df
     id code  date_down    date_up  founddate
1.1   1    p 2019-01-01 2019-01-02       <NA>
1.2   1    f 2019-01-02 2019-01-03       <NA>
2.3   2    f 2019-01-02 2019-01-02       <NA>
2.4   2    p 2019-01-03       <NA>       <NA>
3     3    p 2019-01-04       <NA>       <NA>
4     4 <NA> 2019-01-05 2019-01-05       <NA>
5.7   5    f 2019-01-07 2019-01-08 2019-01-08
5.8   5    p 2019-01-07 2019-01-08 2019-01-09
5.9   5    p 2019-01-09 2019-01-09       <NA>
6.10  6    f 2019-01-10 2019-01-10 2019-01-11
6.11  6    p 2019-01-10 2019-01-10 2019-01-11
6.12  6    p 2019-01-10 2019-01-11       <NA>
> 

Under the given conditions, there are more than one match per row. The proposed answer gets the first match, but this can be modified.

Hope it helps.

like image 29
Santiago Capobianco Avatar answered Oct 06 '22 00:10

Santiago Capobianco