How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?
Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:
pop <- data.table(group_id = c(1, 2, 3),
N = c(4588L, 4589L, 4589L),
N_surv_1 = c(4213, 4243, 4264),
N_surv_2 = c(3703, 3766, 3820),
N_surv_3 = c(2953, 3054, 3159) )
# group_id N N_surv_1 N_surv_2 N_surv_3
# 1 4588 4213 3703 2953
# 2 4589 4243 3766 3054
# 3 4589 4264 3820 3159
(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)
What I have done: using the base diff
and matrix transposition, we can:
diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff)
# produces desired output:
# group_id deaths_1 deaths_2 deaths_3
# 1 -375 -510 -750
# 2 -346 -477 -712
# 3 -325 -444 -661
I know that I can use base diff
by group on a single column produced by melt.data.table
, so this works but ain't pretty:
melt(pop,
id.vars = "group_id"
)[order(group_id)][, setNames(as.list(diff(value)),
paste0("deaths_",1:(ncol(pop)-2)) ),
keyby = group_id]
Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?
Well, you could subtract the subsets:
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
# N_surv_1 N_surv_2 N_surv_3
# 1: -375 -510 -750
# 2: -346 -477 -712
# 3: -325 -444 -661
You could assign these values to new columns with :=
. I have no idea why tail
and head
are not made more easily available... As pointed out by @akrun, you could use with=FALSE
instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols]
.
Anyway, this is pretty convoluted compared to simply reshaping:
melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
# group_id V1
# 1: 1 -375
# 2: 1 -510
# 3: 1 -750
# 4: 2 -346
# 5: 2 -477
# 6: 2 -712
# 7: 3 -325
# 8: 3 -444
# 9: 3 -661
Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff
on each row, i.e. unlist(.SD)
:
pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]
# group_id deaths_1 deaths_2 deaths_3
# 1: 1 -375 -510 -750
# 2: 2 -346 -477 -712
# 3: 3 -325 -444 -661
Essentially, something like this if you ignore setting up the column names:
pop[, as.list(diff(unlist(.SD))), group_id]
Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]
I did some benchmarking
rows<-10000000
pop <- data.table(group_id = 1:rows,
N = runif(rows,3000,4000),
N_surv_1 = runif(rows,3000,4000),
N_surv_2 = runif(rows,3000,4000),
N_surv_3 = runif(rows,3000,4000))
system.time({
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]})
and it returned
user system elapsed
0.192 0.808 1.003
In contrast I did
system.time(pop[, as.list(diff(unlist(.SD))), group_id])
and it returned
user system elapsed
169.836 0.428 170.469
I also did
system.time({
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
})
which returned
user system elapsed
0.044 0.044 0.089
Finally, doing
system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])
returns
user system elapsed
223.360 1.736 225.315
Frank's Map
solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With