Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to perform multiple row-wise operations with dependency with previous rows using [r] data.table (if possible)


I have the following data table:

dt <- fread("
  ID   | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5  |  1.2 |      |   A  
ID_002 |      |      |      |   A
ID_003 |      |      |      |   A
ID_004 |      |      |      |   A
ID_001 | 0.4  |  2.5 |      |   B
ID_002 |      |      |      |   B
ID_003 |      |      |      |   B
ID_004 |      |      |      |   B  
            sep = "|",
            colClasses = c("character", "numeric", "numeric", "numeric", "character"))

and I'm trying to perform some row-wise operations, which sometimes depend on data from previous rows. More specifically:

calc_EO_1 <- function(
  EO_1 <- shift(EO_1, type = "lag") * shift(EO_2, type = "lag")

calc_EO_2 <- function(
  EO_2 <- EO_1 * shift(EO_2, type = "lag") * shift(EO_3, type = "lag")

calc_EO_3 <- function(
  EO_3 <- EO_1 * EO_2

The last one would need to be calculated from the first row since it depends on the other fields (that should be easy) and, after that, all three operations would have to take place consecutively and row-wise.

The closest I've been has been the following:

first_row_bygroup_index <- dt[, .I[1], by = GROUP]$V1

   EO_3 := calc_EO_3(EO_1, EO_2)

   `:=` (
     EO_1 = calc_EO_1(EO_1, EO_2),
     EO_2 = calc_EO_2(EO_1, EO_2, EO_3),
     EO_3 = calc_EO_3(EO_1, EO_2)
   by = row.names(dt[!first_row_bygroup_index])]

but it only calculates the first row properly:

  ID   | EO_1 | EO_2 | EO_3 | GROUP
ID_001 | 0.5  |  1.2 |  0.6 |   A  
ID_002 |      |      |      |   A
ID_003 |      |      |      |   A
ID_004 |      |      |      |   A
ID_001 | 0.4  |  2.5 |  1.0 |   B
ID_002 |      |      |      |   B
ID_003 |      |      |      |   B
ID_004 |      |      |      |   B  

Being those spaces NAs.

I don't think I'm too far away from the solution, but I'm not able to find a way to make it work. The problem is that I can't perform operations in subsets of rows using rows from outside the subset.

EDIT I missed the expected result:

  ID   |   EO_1      |     EO_2      |       EO_3      | GROUP
ID_001 |  0.50000000 |   1.20000000  |      0.60000000 |   A  
ID_002 |  0.60000000 |   0.43200000  |      0.25920000 |   A
ID_003 |  0.25920000 |   0.02902376  |      0.00752296 |   A
ID_004 |  0.00752296 |   0.00000164  |      0.00000001 |   A
ID_001 |  0.40000000 |   2.50000000  |      1.00000000 |   B
ID_002 |  1.00000000 |   2.50000000  |      2.50000000 |   B
ID_003 |  2.50000000 |  15.62500000  |     39.06250000 |   B
ID_004 | 39.06250000 | 23841.8580000 | 931322.57810000 |   B   

NEW EDIT I came up with the following snippet, but I would rather wait a bit to see if someone can get a more efficient solution than this one:

  dt[, `:=` (
    EO_3 = calc_EO_3(EO_1, EO_2),
    EO_1 = ifelse(ID == "ID_001", EO_1, calc_EO_1(EO_1, EO_2)),
    EO_2 = ifelse(ID == "ID_001", EO_2, calc_EO_2(EO_1, EO_2, EO_3))

I've come up with a similar dplyr solution, with that ugly while-loop fix as well. The key would be to find a way to make a rowwise calculation that could get info from the row before, even though that row before would outside of the subset selected. I hope someone can improve this, so I'll wait a little bit before marking it as a solution.

like image 654
sneaky_lobster Avatar asked Jun 13 '19 17:06


2 Answers

Here is another possible approach:

dt[!is.na(EO_1), EO_3 := EO_1 * EO_2, by=.(GROUP)]
dt[ID!="ID_001", c("EO_1", "EO_2", "EO_3") :=
            eo1 <- EO_1[1L]; eo2 <- EO_2[1L]; eo3 <- EO_3[1L]
                    eo1 <- eo1 * eo2
                    eo2 <- eo1 * eo2 * eo3
                    eo3 <- eo1 * eo2
                    .(eo1, eo2, eo3)
        by=.(GROUP)][, -1L:-2L]


       ID        EO_1         EO_2         EO_3 GROUP
1: ID_001  0.50000000 1.200000e+00 6.000000e-01     A
2: ID_002  0.60000000 4.320000e-01 2.592000e-01     A
3: ID_003  0.25920000 2.902376e-02 7.522960e-03     A
4: ID_004  0.00752296 1.642598e-06 1.235720e-08     A
5: ID_001  0.40000000 2.500000e+00 1.000000e+00     B
6: ID_002  1.00000000 2.500000e+00 2.500000e+00     B
7: ID_003  2.50000000 1.562500e+01 3.906250e+01     B
8: ID_004 39.06250000 2.384186e+04 9.313226e+05     B
like image 173
chinsoon12 Avatar answered Sep 18 '22 00:09


Is this the kind of data you'd expect the end product to look like?

go <- function(x, y, n) {
  z <- x * y
  for (i in 1:(n - 1)) {
    x <- c(x[1] * y[1], x)
    y <- c(x[1] * y[1] * z[1], y)
    z <- x * y
  data.table(EO_1 = x, EO_2 = y, EO_3 = z)[.N:1][, lapply(.SD, round, 8)]

go(.5, 1.2, 4)

         EO_1       EO_2       EO_3
1: 0.50000000 1.20000000 0.60000000
2: 0.60000000 0.43200000 0.25920000
3: 0.25920000 0.02902376 0.00752296
4: 0.00752296 0.00000164 0.00000001
like image 34
James B Avatar answered Sep 18 '22 00:09

James B