I am looking for a more efficient way of reshaping data.table data in R.
At the moment I am looping through to carry out a re-shape of multiple time series predictions.
I get the correct answer that I am after, however feel that the methodology is very inelegant/(un-data.table). Therefore I am looking to the SO community to see if there is a more elegant solution.
Please see below for data setup and also two attempts at getting to the desired answer.
# load libraries
require(data.table)
require(lubridate)
# set up data assumptions
id_vec <- letters
len_id_vec <- length(id_vec)
num_orig_dates <- 7
set.seed(123)
# create original data frame
orig <- data.table(ID=rep(id_vec,each=num_orig_dates),
date=rep(c(Sys.Date() %m+% months(0: (num_orig_dates-1))),times=len_id_vec),
most_recent_bal=unlist(lapply(round(runif(len_id_vec)*100),function(y){
y*cumprod(1+rnorm(num_orig_dates,0.001,0.002))})))
# add 24 months ahead predictions of balances using a random walk from the original dates
nrow_orig <- nrow(orig)
for(i in seq(24)){
orig[,paste0('pred',i,'_bal'):=most_recent_bal*(1+rnorm(nrow_orig,0.001,0.003))]
orig[,paste0('pred',i,'_date'):=date %m+% months(i)]
}
# First attempt
t0 <- Sys.time()
tmp1 <- rbindlist(lapply(unique(orig$ID),function(x){
orig1 <- orig[ID==x,]
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
# Go through each original date to realign predicted date and predicted balance
date_vec <- orig1$date
tmp <- rbindlist(lapply(date_vec,function(y){
tmp <- data.table(dates=as.Date(as.vector(t(orig1[date==y,date_cols,with=FALSE]))),
bals=as.vector(t(orig1[date==y,bal_cols,with=FALSE])))
tmp[,type:='prediction']
tmp[,date_prediction_run:=y]
# collect historical information too for plotting perposes.
tmp1 <- orig1[date<=y,c('date','most_recent_bal'),with=FALSE]
if(nrow(tmp1)!=0){
setnames(tmp1,c('date','most_recent_bal'),c('dates','bals'))
tmp1[,type:='history']
tmp1[,date_prediction_run:=y]
tmp <- rbind(tmp,tmp1)
}
tmp
}))
tmp[,ID:=x]
}))
t1 <- Sys.time()
t1-t0 #Time difference of 1.117216 secs
# Second Attempt: a slightly more data.table way which is faster but still very inelegant....
t2 <- Sys.time()
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
tmp1a <- rbindlist(lapply(unique(orig$ID),function(x){
tmp <- cbind(melt(orig[ID==x,c('date',bal_cols),with=FALSE],
measure.vars = bal_cols,value.name='bals')[,-('variable'),with=FALSE],
data.table(dates=melt(orig[ID==x,date_cols,with=FALSE],
measure.vars = date_cols)[,value]))
setnames(tmp,'date','date_prediction_run')
tmp[,type:='prediction']
tmp1 <- orig[ID==x,orig[ID==x & date<=.BY,c('date','most_recent_bal'),with=FALSE],by=date]
setnames(tmp1,c('date_prediction_run','dates','bals'))
tmp1[,type:='history']
setcolorder(tmp1,colnames(tmp1)[match(colnames(tmp),colnames(tmp1))])
tmp <- rbind(tmp,tmp1)
tmp[,ID:=x]
tmp
}))
t3 <- Sys.time()
t3-t2 # Time difference of 0.2309799 secs
data.table
solution step-by-stepi.e. to break down what we are doing and, just for this first pass, be readable. N.B. Afterwards, below, (in an update to follow shortly) I'll optimize the solution somewhat by pulling everything together e.g. by combining steps, chaining, in-place assignments, etc. The more optimized solution will, as you might expect, be far less readable without understanding the step-by-step presented here first with the aim of showing people learning data.table how they might arrive at a solution.
# First Pass = Step-by-step (not optimized) just first work out a solution
library(data.table)
# Transform prediction data from `orig` data.table into long format
# i.e. by melting pred#_bal and pred#_date columns
pred_data <-
data.table::melt( orig,
measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),
value.name = c("bals", "date_prediction_run"))
pred_data[, type := "prediction"] # add the 'type' column to pred_data (all are type="prediction")
# select desired columns in order
pred_data <- pred_data[, .( dates=date, bals, type, date_prediction_run, ID)]
# Collect historical information from the most_recent_bal column,
# which the OP wants for plotting purposes
graph_data <-
orig[ orig,
.(ID, dates=date, bals=most_recent_bal, date_prediction_run=x.date),
on=.(ID, date>=date)]
graph_data[, type := "history"] # these are all type="history"
# final output, combining the prediction data and the graph data:
output <- rbindlist(list(pred_data, graph_data), use.names=TRUE)
Below is my "First pass at optimizing by combining some steps and chaining" However, even though below I have combined some steps, used chaining and it looks nice and short, the code below is no faster than the original step-by-step solution above as I'll show at the end of the post with benchmark timings. I'm leaving the code below as it illustrates a good point and presents a learning opportunity.
First pass at optimizing by combining some steps and chaining [not faster!]library(data.table)
# Transform prediction data into long format
# by melting pred#_bal and pred#_date columns
pred_data <-
data.table::melt( orig[, type := "prediction"], #add the type column to orig, before melting
measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),
value.name = c("bals", "date_prediction_run")
)[, .( dates=date, bals, type, date_prediction_run, ID)] # chain, to select desired columns in order
# FINAL RESULT: rbindlist pred_data to historic data
pred_data <-
rbindlist( list( pred_data, orig[ orig[, type := "history"],
.(dates=date, bals=most_recent_bal, type, date_prediction_run=x.date, ID),
on=.(ID, date>=date)]
),
use.names=TRUE)
Testing timings using the very handy microbenchmark
package:
Unit: milliseconds
expr min lq mean median uq max neval
h.l.m_first_attempt 1140.017957 1190.818176 1249.499493 1248.977454 1299.497679 1427.632140 100
h.l.m_second_attempt 231.380930 239.513223 254.702865 249.735005 262.516276 375.762675 100
krads_step.by.step 2.855509 2.985509 3.289648 3.059481 3.269429 6.568006 100
krads_optimized 2.909343 3.073837 3.555803 3.150584 3.554100 12.521439 100
The benchmark results show the data.table solutions are huge timing improvements from the OP's solution. Great, that's what was asked for: We've shown how awesomely fast data.table
can be but I hope also how it can be simple & readable too!
However, don't miss another lesson here:
Looking at the microbenchmark results, note how both my solutions are effectively the same mean time. At first that might not make sense: Why is my "step-by-step" solution with so many more lines of code effectively just as fast as my attempted "optimized" solution?
Answer: If you look closely, all the same steps appear in both my solutions. In my "optimized" solution, yes, we're chaining and you might at first think doing fewer assignments than the "step-by-step" literally spells out. But, as the benchmark results should tell you we have NOT done fewer assignments! I.e. at each point where we use []
to "chain" together another operation, it is literally equivalent to assigning back to your original DT with <-
.
If you can wrap your head around that you'll be on your way to better programming: You can confidently skip the step of "chaining" and instead use <-
to spell out a step-by-step (more readable, easier to debug and more maintainable) solution!
Where you can save time it comes down to finding places to not assign multiple times unnecessarily in a loop or apply operation. But that's a topic for another post I think!
N.B. In case you want to use microbenchmark
on your own code, all I did was this:
library(microbenchmark)
mbm <- microbenchmark(
h.l.m_first_attempt = {
# Pasted in h.l.m's first solution, here
},
h.l.m_second_attempt = {
# Pasted in h.l.m's second solution, here
},
krads_step.by.step = {
# Pasted in my first solution, here
},
krads_optimized = {
# Pasted in my second solution, here
},
times = 100L
)
mbm
If you want a graph, follow with:
library(ggplot2)
autoplot(mbm)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With