Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reshaping prediction data efficiently using data.table in R

Tags:

r

data.table

I am looking for a more efficient way of reshaping data.table data in R.

At the moment I am looping through to carry out a re-shape of multiple time series predictions.

I get the correct answer that I am after, however feel that the methodology is very inelegant/(un-data.table). Therefore I am looking to the SO community to see if there is a more elegant solution.

Please see below for data setup and also two attempts at getting to the desired answer.

# load libraries
require(data.table)
require(lubridate)


# set up data assumptions
id_vec <- letters
len_id_vec <- length(id_vec)
num_orig_dates <- 7
set.seed(123)


# create original data frame
orig <- data.table(ID=rep(id_vec,each=num_orig_dates),
                   date=rep(c(Sys.Date() %m+% months(0: (num_orig_dates-1))),times=len_id_vec),
                   most_recent_bal=unlist(lapply(round(runif(len_id_vec)*100),function(y){
                     y*cumprod(1+rnorm(num_orig_dates,0.001,0.002))})))


# add 24 months ahead predictions of balances using a random walk from the original dates
nrow_orig <- nrow(orig)

for(i in seq(24)){
  orig[,paste0('pred',i,'_bal'):=most_recent_bal*(1+rnorm(nrow_orig,0.001,0.003))]
  orig[,paste0('pred',i,'_date'):=date %m+% months(i)]
}


# First attempt
t0 <- Sys.time()
tmp1 <- rbindlist(lapply(unique(orig$ID),function(x){
  orig1 <- orig[ID==x,]

  bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
  date_cols <- c('date',paste0('pred',seq(24),'_date'))

  # Go through each original date to realign predicted date and predicted balance
  date_vec <- orig1$date
  tmp <- rbindlist(lapply(date_vec,function(y){

    tmp <- data.table(dates=as.Date(as.vector(t(orig1[date==y,date_cols,with=FALSE]))),
                      bals=as.vector(t(orig1[date==y,bal_cols,with=FALSE])))
    tmp[,type:='prediction']
    tmp[,date_prediction_run:=y]

    # collect historical information too for plotting perposes.
    tmp1 <- orig1[date<=y,c('date','most_recent_bal'),with=FALSE]
    if(nrow(tmp1)!=0){

      setnames(tmp1,c('date','most_recent_bal'),c('dates','bals'))
      tmp1[,type:='history']
      tmp1[,date_prediction_run:=y]

      tmp <- rbind(tmp,tmp1)

    }

    tmp
  }))
  tmp[,ID:=x]
}))
t1 <- Sys.time()
t1-t0 #Time difference of 1.117216 secs

# Second Attempt: a slightly more data.table way which is faster but still very inelegant....
t2 <- Sys.time()
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
tmp1a <- rbindlist(lapply(unique(orig$ID),function(x){
  tmp <- cbind(melt(orig[ID==x,c('date',bal_cols),with=FALSE],
                    measure.vars = bal_cols,value.name='bals')[,-('variable'),with=FALSE],
               data.table(dates=melt(orig[ID==x,date_cols,with=FALSE],
                                     measure.vars = date_cols)[,value]))
  setnames(tmp,'date','date_prediction_run')
  tmp[,type:='prediction']

  tmp1 <- orig[ID==x,orig[ID==x & date<=.BY,c('date','most_recent_bal'),with=FALSE],by=date]
  setnames(tmp1,c('date_prediction_run','dates','bals'))
  tmp1[,type:='history']
  setcolorder(tmp1,colnames(tmp1)[match(colnames(tmp),colnames(tmp1))])
  tmp <- rbind(tmp,tmp1)
  tmp[,ID:=x]
  tmp
}))
t3 <- Sys.time()
t3-t2 # Time difference of 0.2309799 secs
like image 895
h.l.m Avatar asked Aug 30 '18 11:08

h.l.m


1 Answers

Using data.table per the OP's request.

First, just showing how to build a data.table solution step-by-step

i.e. to break down what we are doing and, just for this first pass, be readable. N.B. Afterwards, below, (in an update to follow shortly) I'll optimize the solution somewhat by pulling everything together e.g. by combining steps, chaining, in-place assignments, etc. The more optimized solution will, as you might expect, be far less readable without understanding the step-by-step presented here first with the aim of showing people learning data.table how they might arrive at a solution.

# First Pass = Step-by-step (not optimized) just first work out a solution 

library(data.table)

# Transform prediction data from `orig` data.table into long format
# i.e. by melting pred#_bal and pred#_date columns
pred_data <- 
  data.table::melt( orig, 
                    measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),  
                    value.name =     c("bals",           "date_prediction_run"))

pred_data[, type := "prediction"]  # add the 'type' column to pred_data (all are type="prediction")

# select desired columns in order
pred_data <- pred_data[, .( dates=date, bals, type, date_prediction_run, ID)] 


# Collect historical information from the most_recent_bal column, 
# which the OP wants for plotting purposes

graph_data <- 
  orig[ orig, 
        .(ID, dates=date, bals=most_recent_bal, date_prediction_run=x.date),
        on=.(ID, date>=date)]

graph_data[, type := "history"]  # these are all type="history" 

# final output, combining the prediction data and the graph data:
output <- rbindlist(list(pred_data, graph_data), use.names=TRUE)

UPDATE 3 = IMPORTANT NOTE: The code below does nothing to improve speed!

Below is my "First pass at optimizing by combining some steps and chaining" However, even though below I have combined some steps, used chaining and it looks nice and short, the code below is no faster than the original step-by-step solution above as I'll show at the end of the post with benchmark timings. I'm leaving the code below as it illustrates a good point and presents a learning opportunity.

First pass at optimizing by combining some steps and chaining [not faster!]
library(data.table)

# Transform prediction data into long format
# by melting pred#_bal and pred#_date columns
pred_data <- 
  data.table::melt( orig[, type := "prediction"],  #add the type column to orig, before melting 
                    measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),  
                    value.name =     c("bals",           "date_prediction_run")
                  )[, .( dates=date, bals, type, date_prediction_run, ID)] # chain, to select desired columns in order


# FINAL RESULT:  rbindlist pred_data to historic data
pred_data <- 
  rbindlist( list( pred_data, orig[ orig[, type := "history"],  
                                    .(dates=date, bals=most_recent_bal, type, date_prediction_run=x.date, ID),
                                    on=.(ID, date>=date)]
                 ), 
             use.names=TRUE)

Continuing UPDATE 3:

Testing timings using the very handy microbenchmark package:

Unit: milliseconds
                expr         min          lq        mean      median          uq         max neval
 h.l.m_first_attempt 1140.017957 1190.818176 1249.499493 1248.977454 1299.497679 1427.632140   100
h.l.m_second_attempt  231.380930  239.513223  254.702865  249.735005  262.516276  375.762675   100
  krads_step.by.step    2.855509    2.985509    3.289648    3.059481    3.269429    6.568006   100
     krads_optimized    2.909343    3.073837    3.555803    3.150584    3.554100   12.521439   100
The benchmark results show the data.table solutions are huge timing improvements from the OP's solution. Great, that's what was asked for: We've shown how awesomely fast data.table can be but I hope also how it can be simple & readable too! However, don't miss another lesson here:

Looking at the microbenchmark results, note how both my solutions are effectively the same mean time. At first that might not make sense: Why is my "step-by-step" solution with so many more lines of code effectively just as fast as my attempted "optimized" solution?

Answer: If you look closely, all the same steps appear in both my solutions. In my "optimized" solution, yes, we're chaining and you might at first think doing fewer assignments than the "step-by-step" literally spells out. But, as the benchmark results should tell you we have NOT done fewer assignments! I.e. at each point where we use [] to "chain" together another operation, it is literally equivalent to assigning back to your original DT with <-.

If you can wrap your head around that you'll be on your way to better programming: You can confidently skip the step of "chaining" and instead use <- to spell out a step-by-step (more readable, easier to debug and more maintainable) solution!

Where you can save time it comes down to finding places to not assign multiple times unnecessarily in a loop or apply operation. But that's a topic for another post I think!

N.B. In case you want to use microbenchmark on your own code, all I did was this:

library(microbenchmark)
mbm <- microbenchmark(
  h.l.m_first_attempt = {
    # Pasted in h.l.m's first solution, here
  },

  h.l.m_second_attempt = {
    # Pasted in h.l.m's second solution, here
  },

  krads_step.by.step = {
    # Pasted in my first solution, here
  },

  krads_optimized = {
    # Pasted in my second solution, here
  },
  times = 100L
)
mbm

If you want a graph, follow with:

library(ggplot2)
autoplot(mbm)
like image 93
krads Avatar answered Sep 29 '22 18:09

krads