Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative vector in data table

Tags:

r

data.table

I have the following data table:

library(data.table)
dat = data.table(j = c(3,8,9,11,10,28), gr = c(9,9,9,9,10,10))
> dat
    j gr
1:  3  9
2:  8  9
3:  9  9
4: 11  9
5: 10 10
6: 28 10

There are two groups (specified by 'gr') and they are ordered. Now what I would like to achieve is to create for each row of each group a cumulative vector of values in 'j'. The result should be a list column, like this:

res_dat = data.table(j = c(3,8,9,11,10,28), gr = c(9,9,9,9,10,10),
                     res = list(3, c(3,8), c(3,8,9), c(3,8,9,11),
                                10, c(10, 28)))
> res_dat
    j gr         res
1:  3  9           3
2:  8  9         3,8
3:  9  9       3,8,9
4: 11  9  3, 8, 9,11
5: 10 10          10
6: 28 10       10,28

I tried the following:

First I created a dummy column with the number per row for each group.

dat[, tmp:= seq_len(.N), by = gr]

My plan was to use that number in order to subset the j vector, but I have not managed. None of these worked:

dat[, res := list(j[1:tmp]), by = gr]
dat[, res := list(list(j[1:tmp])), by = gr] # based on https://stackoverflow.com/questions/22531477/using-lists-inside-data-table-columns

I get the following error:

Warning messages:
1: In 1:tmp : numerical expression has 4 elements: only the first used
2: In 1:tmp : numerical expression has 2 elements: only the first used

which does help understand how it fails but I don't know how I can make it succeed. Any ideas?

like image 365
User2321 Avatar asked Aug 15 '21 13:08

User2321


2 Answers

This is Henrik's answer (and if they come back, I'll happy give this answer to them ... somehow):

dat[, res := .(Reduce(c, j, accumulate=TRUE)), by = gr]
#        j    gr         res
#    <num> <num>      <list>
# 1:     3     9           3
# 2:     8     9         3,8
# 3:     9     9       3,8,9
# 4:    11     9  3, 8, 9,11
# 5:    10    10          10
# 6:    28    10       10,28

Reduce is similar to sapply except that it operates on the current value and results of the previous operation. For instance, we can see

sapply(1:3, function(z) z*2)
# [1] 2 4 6

This, unrolled, equates to

1*2 # 2
2*2 # 4
3*2 # 6

That is, the calculation on one element of the vector/list is completely independent, never knowing the results from previous iterations.

However, Reduce is explicitly given the results of the previous calculation. By default, it will only return the last calculation, which would be analogous to tail(sapply(...), 1):

Reduce(function(prev, this) prev + this*2, 11:13)
# [1] 61

That seems a bit obscure ... let's look at all of the interim steps, where the answer above is the last:

Reduce(function(prev, this) prev + this*2, 11:13, accumulate = TRUE)
# [1] 11 35 61

In this case (without specifying init=, wait for it), the first result is just the first value in x=, not run through the function. If we unroll this, we'll see

11        # 11 is the first value in x
   _________/
  /
 v
11 + 12*2 # 35
35 + 13*2 # 61

Sometimes we need the first value in x= to be run through the function, with a starting condition (a first-time value for prev when we don't have a previous iteration to use). For that, we can use init=; we can think of the use of init= by looking at two perfectly-equivalent calls:

Reduce(function(prev, this) prev + this*2, 11:13, accumulate = TRUE)
Reduce(function(prev, this) prev + this*2, 12:13, init = 11, accumulate = TRUE)
# [1] 11 35 61

(Without init=, Reduce will take the first element of x= and assign it to init= and remove it from x=.)

Now let's say we want the starting condition (injected "previous" value) to be 0, then we would do

Reduce(function(prev, this) prev + this*2, 11:13, init = 0, accumulate = TRUE)
# [1]  0 22 46 72


### unrolled
 0        # 0 is the init= value
   ________/
  /
 v
 0 + 11*2 # 22
22 + 12*2 # 46
46 + 13*2 # 72

Let's bring that back to this question and this data. I'll inject a browser() and change the function a little so that we can look at all intermediate values.

> dat[, res := .(Reduce(function(prev, this) { browser(); c(prev, this); }, j, accumulate=TRUE)), by = gr]
Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 2
[1] 3
Browse[2]> this
[1] 8
Browse[2]> c(prev, this)
[1] 3 8
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 3
[1] 3 8
Browse[2]> this
[1] 9
Browse[2]> c(prev, this)
[1] 3 8 9
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=9`, row 4
[1] 3 8 9
Browse[2]> this
[1] 11
Browse[2]> c(prev, this)
[1]  3  8  9 11
Browse[2]> c                                       # 'c'ontinue

Browse[2]> Called from: f(init, x[[i]])
Browse[1]> debug at #1: c(prev, this)
Browse[2]> prev                                    # group `gr=10`, row 6
[1] 10
Browse[2]> this
[1] 28
Browse[2]> c(prev, this)
[1] 10 28
Browse[2]> c                                       # 'c'ontinue

Notice how we didn't "see" rows 1 or 5, since they were the init= conditions for the reduction (the first prev value seen in each group).

Reduce can be a difficult function to visualize and work with. When I use it, I almost always pre-insert browser() into the anon-function and walk through the first three steps: the first to ensure the init= is correct, the second to make sure the anon-function is doing what I think I want with the init and next value, and the third to make sure that it continues properly. This is similar to "Proof by Deduction": the nth calc will be correct because we know the (n-1)th calc is correct.

like image 55
r2evans Avatar answered Sep 30 '22 13:09

r2evans


Here is a simple tidyverse solution, unfortunately I haven't started learning data.table but until you get a relevant solution it will do fine:

library(dplyr)
library(purrr)

dat %>%
  group_by(gr) %>%
  mutate(res = accumulate(j[-1], .init = j[1], ~ c(.x, .y)))

# A tibble: 6 x 3
# Groups:   gr [2]
      j    gr res      
  <dbl> <dbl> <list>   
1     3     9 <dbl [1]>
2     8     9 <dbl [2]>
3     9     9 <dbl [3]>
4    11     9 <dbl [4]>
5    10    10 <dbl [1]>
6    28    10 <dbl [2]>

Or in base R we could do:

do.call(rbind, lapply(unique(dat$gr), function(a) {
  tmp <- subset(dat, gr == a)
  tmp$res <- Reduce(c, tmp$j, accumulate = TRUE)
  tmp
}))

Or even this terrific yet concise suggestion by dear @Henrik:

do.call(rbind, by(dat, dat$gr, function(d){
  cbind(d, res = I(Reduce(c, d$j, accumulate=TRUE)))}))

Or even this one also by dear @Henrik:

dat$res = ave(dat$j, dat$gr, FUN = function(x) Reduce(c, x, accumulate=TRUE))
like image 29
Anoushiravan R Avatar answered Sep 30 '22 13:09

Anoushiravan R