Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple conditions for r data.table calculation

Tags:

r

data.table

I have a data.table like the following:

   Sim j active cost
1:   1 1      1  100
2:   1 2      1  125
3:   1 3      0  200
4:   1 4      1  250
5:   2 1      1  100
6:   2 2      0  50
7:   2 3      0  125
8:   2 4      1  200

dt <- data.table(Sim = c(1, 1, 1, 1, 2, 2, 2, 2),
             j = c(1, 2, 3, 4, 1, 2, 3, 4),
             active = c(1, 1, 0, 1, 1, 0, 0, 1),
             cost = c(100, 125, 200, 250, 100, 50, 125, 200))

I want to add a column 'incr_cost' that subtracts the cost in each row i from the cost in a different row, which I'll call row k, where row k meets these conditions:

  • sim_k = sim_i
  • active_k = 1
  • j_k < j_i
  • row k contains the largest j of all rows meeting the 3 above conditions

For rows where j=1, incr_cost can just be NA.

In my example, the solution would look like:

   Sim j active cost incr_cost
1:   1 1      1  100        NA
2:   1 2      1  125        25
3:   1 3      0  200        75
4:   1 4      1  250       125
5:   2 1      1  100        NA
6:   2 2      0   50       -50
7:   2 3      0  125        25
8:   2 4      1  200       100

It seems like this is similar to applications of shift, except that instead of 'shifting' on the data.table as is, I want to shift on row-reduced data.table where rows not meeting my conditions are filtered out. I'm having a hard time understanding how to identify the row that has the largest j value that is less than my current row (and meets the other two conditions).

The following works except that it does not consider whether a row is active when selecting row k:

dt[, incr_cost := cost - shift(cost, fill=NA), by=Sim]

I am using r data.table, but non-data.table solutions are also welcome. Thank you!

like image 757
Alton Avatar asked Apr 21 '18 01:04

Alton


1 Answers

You can use a rolling join:

dt[, v := 
  cost - .SD[.(active = 1, Sim = Sim, j = j - 1), on=.(active, Sim, j), roll=TRUE, x.cost]]

   Sim j active cost   v
1:   1 1      1  100  NA
2:   1 2      1  125  25
3:   1 3      0  200  75
4:   1 4      1  250 125
5:   2 1      1  100  NA
6:   2 2      0   50 -50
7:   2 3      0  125  25
8:   2 4      1  200 100

This looks up the tuples .(active = 1, Sim = Sim, j = j - 1) and when an exact match is not found, "rolls" to the last j value that fits, if any.

How it works

In j of x[i, j], .SD is just a shorthand for the table itself, the "Subset of Data".

In j of a join x[i, on=, roll=, j]...

  • the prefix x.* refers to columns of x (here, .SD); and similarly
  • i.* would be a prefix for columns of i (here, the tuples).

(OP's use of j as a name might make this confusing. I mean j, the argument in DT[i, j, ...].)

like image 105
Frank Avatar answered Nov 04 '22 22:11

Frank