Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take column-wise differences across a data.table

How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?

Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:

pop <- data.table(group_id = c(1, 2, 3), 
                   N = c(4588L, 4589L, 4589L), 
                   N_surv_1 = c(4213, 4243, 4264), 
                   N_surv_2 = c(3703, 3766, 3820), 
                   N_surv_3 = c(2953, 3054, 3159) )
# group_id    N N_surv_1 N_surv_2 N_surv_3
#        1 4588     4213     3703     2953
#        2 4589     4243     3766     3054
#        3 4589     4264     3820     3159

(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)

What I have done: using the base diff and matrix transposition, we can:

diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff) 
# produces desired output:
#    group_id deaths_1 deaths_2 deaths_3
#           1     -375     -510     -750
#           2     -346     -477     -712
#           3     -325     -444     -661

I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:

melt(pop, 
     id.vars = "group_id"
     )[order(group_id)][, setNames(as.list(diff(value)),
                                   paste0("deaths_",1:(ncol(pop)-2)) ),
                          keyby = group_id]

Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?

like image 705
C8H10N4O2 Avatar asked Aug 17 '16 16:08

C8H10N4O2


3 Answers

Well, you could subtract the subsets:

ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
  `-`, 
   utils:::tail.default(.SD, -1), 
   utils:::head.default(.SD, -1)
), .SDcols=ncols]

#    N_surv_1 N_surv_2 N_surv_3
# 1:     -375     -510     -750
# 2:     -346     -477     -712
# 3:     -325     -444     -661

You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by @akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].

Anyway, this is pretty convoluted compared to simply reshaping:

melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
#    group_id   V1
# 1:        1 -375
# 2:        1 -510
# 3:        1 -750
# 4:        2 -346
# 5:        2 -477
# 6:        2 -712
# 7:        3 -325
# 8:        3 -444
# 9:        3 -661
like image 142
Frank Avatar answered Oct 21 '22 11:10

Frank


Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):

pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]

#    group_id deaths_1 deaths_2 deaths_3
# 1:        1     -375     -510     -750
# 2:        2     -346     -477     -712
# 3:        3     -325     -444     -661

Essentially, something like this if you ignore setting up the column names:

pop[, as.list(diff(unlist(.SD))), group_id]
like image 31
Psidom Avatar answered Oct 21 '22 12:10

Psidom


Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.

cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
  combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]

I did some benchmarking

rows<-10000000
pop <- data.table(group_id = 1:rows, 
                  N = runif(rows,3000,4000), 
                  N_surv_1 = runif(rows,3000,4000), 
                  N_surv_2 = runif(rows,3000,4000), 
                  N_surv_3 = runif(rows,3000,4000))
system.time({
    cols<-names(pop)[-1]
    combs<-list()
    for(i in 2:length(cols)) {
      combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
    }
    newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
    deathpop<-copy(pop)
    deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
    deathpop[,(cols):=NULL]})

and it returned

user  system elapsed 
0.192   0.808   1.003 

In contrast I did

system.time(pop[, as.list(diff(unlist(.SD))), group_id])

and it returned

   user  system elapsed 
169.836   0.428 170.469 

I also did

system.time({
  ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
  pop[, Map(
    `-`, 
    utils:::tail.default(.SD, -1), 
    utils:::head.default(.SD, -1)
  ), .SDcols=ncols]
})

which returned

 user  system elapsed 
0.044   0.044   0.089 

Finally, doing

system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])

returns

   user  system elapsed 
223.360   1.736 225.315 

Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.

like image 25
Dean MacGregor Avatar answered Oct 21 '22 11:10

Dean MacGregor