Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R :: data.table: Generate a running balance by group using previous balance and row-wise iteration

I have following DT (data.table) in R.

dt <- fread("
id| rowids | charge | payment | balance
a |   1    |  7.1   |   0     |     
a |   2    |  1.2   |   3     |   
a |   3    |  1.7   |   1     |   
b |   1    |  8.1   |   0     |   
b |   2    |  2.5   |   4     |   
b |   3    |  2.3   |   2     |   
b |   4    |  3.2   |   1     |   
            ", 
            sep = "|",
            colClasses = c("character", "numeric", "numeric", "numeric", 
"numeric"))

The "balance" is should be computed, within each id group, as "balance <- previous.row.balance + charge - payment", where the "previous.row.balance" is the previous row entry of "balance".

I initially underestimate the difficulty to compute the running balance. I was thinking about dt[,previous.row.balance := (shift(balance,1),by=id]. But R does vectorized computation. I did not have values in "balance" available for me to perform shift() since "balance" will be computed through row-by-row iteration.

I searched on StackOverflow and found a similar question and its first answer greatly helped me to think through the whole process. I adapted the code in the first answer to my problem and got the following code working wonderfully to generate the running balance by group.

dt[rowids == 1, balance := charge, by=.(id)]
dt[rowids != 1, balance :=
    dt[,
        {
            balance1 <- balance[1L]
            .SD[rowids != 1,
                {balance1 <-  balance1 + charge - payment
                    .(balance1)
                },
                by=.(rowids)]
        },
        by=.(id)][, -1L:-2L]
]

Here are my questions.

  1. I still cannot understand how by=.(id)][, -1L:-2L], the chained brackets worked the iteration out. Since the code does not employ shift() by = group, I guess [, -1L:-2L] does the trick here to perform the iteration. But how? What does [, -1L:-2L] actually do here?

Sorry that I have to ask this question here, instead of commenting or asking under that question . The reason is that I am brand new to StackOverflow with only 1 point of reputation. I am not allowed to comment on the original answer to that question. I also would like to vote up for that answer. Before I can do that, I have to earn more points.

  1. Is there any other way, using data.table and R vectorizing computation to achieve this running balance goal, without wrapping any loop for row iteration?

Any insight or thought is appreciated!

like image 924
Jane Lu Avatar asked Sep 30 '19 17:09

Jane Lu


Video Answer


2 Answers

Regarding your question #2:

You can use the cumsum function (output matches that of the code in the question). This will take the value of charge - payment for the first row, then for the second the second charge - payment will be added to that, et cetera.

dt[, balance2 := cumsum(charge - payment), id]


dt
#    id rowids charge payment balance balance2
# 1:  a      1    7.1       0     7.1      7.1
# 2:  a      2    1.2       3     5.3      5.3
# 3:  a      3    1.7       1     6.0      6.0
# 4:  b      1    8.1       0     8.1      8.1
# 5:  b      2    2.5       4     6.6      6.6
# 6:  b      3    2.3       2     6.9      6.9
# 7:  b      4    3.2       1     9.1      9.1
like image 67
IceCreamToucan Avatar answered Dec 11 '22 10:12

IceCreamToucan


Since @IceCreamToucan has answered part 2 (how to improve the code), I'll just cover part 1 (why x[, -1:-2] works). From ?data.table, we know that in general the j field can be used to select columns:

When j is a vector of column names or positions to select (as in data.frame) [, then it behaves as with a data.frame].

(The words in brackets are my edit to complete the sentence.)

In particular, when j takes the form n:m, ...

  • If all of n..m are negative or zero, then the specified columns are dropped
  • If all of the n..m are positive or zero, then the specified columns are selected

You would also see this behavior with j set to -c(1,2) or !c(1,2) or !(1:2) or -(1:2).

This behavior is based on special parsing of j to check for : or ! or - being the top-level function.

Next, it is important to know that the columns in by= are put as the first columns in the table.

Combining these two points in the OP's example, you have by=id as the first column (the outer by) and by=rowids as the second column (the inner by). After these are dropped with [, -1L:-2L] you have the .(balance1) expression remaining.

like image 27
Frank 2 Avatar answered Dec 11 '22 09:12

Frank 2