I've tried to use machine learning to make prediction based on time-series data. In one of the stackoverflow question (createTimeSlices function in CARET package in R) is an example of using createTimeSlices to cross-validation for model training and parameter tuning: <pre class="prettyprint"><code> library(caret) library(ggplot2) library(pls) data(economics) myTimeControl <- trainControl(method = "timeslice", initialWindow = 36, horizon = 12, fixedWindow = TRUE) plsFitTime <- train(unemploy ~ pce + pop + psavert, data = economics, method = "pls", preProc = c("center", "scale"), trControl = myTimeControl) </code></pre> My understanding is: <ol> <li>I need to split may data to training and test set.</li> <li>Use training set for parameters tuning.</li> <li>Evaluate obtained model on the test set (using R2, RMSE, etc.)</li> </ol> Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set. So, my questions are: Am I right? And If so - How to use createTimeSlices for model evaluation?

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand. However, here is how to use <code>createTimeSlices</code> for splitting the data and then using it for training and testing a model. Step 0: Setting up the data and <code>trainControl</code>:(from your question) <pre class="prettyprint"><code>library(caret) library(ggplot2) library(pls) data(economics) </code></pre> Step 1: Creating the timeSlices for the index of the data: <pre class="prettyprint"><code>timeSlices <- createTimeSlices(1:nrow(economics), initialWindow = 36, horizon = 12, fixedWindow = TRUE) </code></pre> This creates a list of training and testing timeSlices. <pre class="prettyprint"><code>> str(timeSlices,max.level = 1) ## List of 2 ## $ train:List of 431 ## .. [list output truncated] ## $ test :List of 431 ## .. [list output truncated] </code></pre> For ease of understanding, I am saving them in separate variable: <pre class="prettyprint"><code>trainSlices <- timeSlices[[1]] testSlices <- timeSlices[[2]] </code></pre> Step 2: Training on the first of the <code>trainSlices</code>: <pre class="prettyprint"><code>plsFitTime <- train(unemploy ~ pce + pop + psavert, data = economics[trainSlices[[1]],], method = "pls", preProc = c("center", "scale")) </code></pre> Step 3: Testing on the first of the <code>testSlices</code>: <pre class="prettyprint"><code>pred <- predict(plsFitTime,economics[testSlices[[1]],]) </code></pre> Step 4: Plotting: <pre class="prettyprint"><code>true <- economics$unemploy[testSlices[[1]]] plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true))) points(pred, col = "blue") </code></pre> You can then do this for all the slices: <pre class="prettyprint"><code>for(i in 1:length(trainSlices)){ plsFitTime <- train(unemploy ~ pce + pop + psavert, data = economics[trainSlices[[i]],], method = "pls", preProc = c("center", "scale")) pred <- predict(plsFitTime,economics[testSlices[[i]],]) true <- economics$unemploy[testSlices[[i]]] plot(true, col = "red", ylab = "true (red) , pred (blue)", main = i, ylim = range(c(pred,true))) points(pred, col = "blue") } </code></pre> As mentioned earlier, this sort of timeSlicing is done by your original function in one step: <pre class="prettyprint"><code>> myTimeControl <- trainControl(method = "timeslice", + initialWindow = 36, + horizon = 12, + fixedWindow = TRUE) > > plsFitTime <- train(unemploy ~ pce + pop + psavert, + data = economics, + method = "pls", + preProc = c("center", "scale"), + trControl = myTimeControl) > plsFitTime Partial Least Squares 478 samples 5 predictors Pre-processing: centered, scaled Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... Resampling results across tuning parameters: ncomp RMSE Rsquared RMSE SD Rsquared SD 1 1080 0.443 796 0.297 2 1090 0.43 845 0.295 RMSE was used to select the optimal model using the smallest value. The final value used for the model was ncomp = 1. </code></pre> Hope this helps!!

Time-series - data splitting and model evaluation

Tags:

I've tried to use machine learning to make prediction based on time-series data. In one of the stackoverflow question (createTimeSlices function in CARET package in R) is an example of using createTimeSlices to cross-validation for model training and parameter tuning:

    library(caret)
    library(ggplot2)
    library(pls)
    data(economics)
    myTimeControl <- trainControl(method = "timeslice",
                                  initialWindow = 36,
                                  horizon = 12,
                                  fixedWindow = TRUE)

    plsFitTime <- train(unemploy ~ pce + pop + psavert,
                        data = economics,
                        method = "pls",
                        preProc = c("center", "scale"),
                        trControl = myTimeControl)

My understanding is:

I need to split may data to training and test set.
Use training set for parameters tuning.
Evaluate obtained model on the test set (using R2, RMSE, etc.)

Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set. So, my questions are: Am I right? And If so - How to use createTimeSlices for model evaluation?

853

asked Jul 15 '14 12:07

Jot eN

2 Answers

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.

Step 0: Setting up the data and trainControl:(from your question)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:

timeSlices <- createTimeSlices(1:nrow(economics), 
                   initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
##   .. [list output truncated]
## $ test :List of 431
##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices:

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics[trainSlices[[1]],],
                    method = "pls",
                    preProc = c("center", "scale"))

Step 3: Testing on the first of the testSlices:

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue")

You can then do this for all the slices:

for(i in 1:length(trainSlices)){
  plsFitTime <- train(unemploy ~ pce + pop + psavert,
                      data = economics[trainSlices[[i]],],
                      method = "pls",
                      preProc = c("center", "scale"))
  pred <- predict(plsFitTime,economics[testSlices[[i]],])
  
  
  true <- economics$unemploy[testSlices[[i]]]
  plot(true, col = "red", ylab = "true (red) , pred (blue)", 
            main = i, ylim = range(c(pred,true)))
  points(pred, col = "blue") 
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:

> myTimeControl <- trainControl(method = "timeslice",
+                               initialWindow = 36,
+                               horizon = 12,
+                               fixedWindow = TRUE)
> 
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+                     data = economics,
+                     method = "pls",
+                     preProc = c("center", "scale"),
+                     trControl = myTimeControl)
> plsFitTime
Partial Least Squares 

478 samples
  5 predictors

Pre-processing: centered, scaled 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) 

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 

Resampling results across tuning parameters:

  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
  1      1080  0.443     796      0.297      
  2      1090  0.43      845      0.295      

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 1.

Hope this helps!!

149

answered Oct 08 '22 10:10

Shambho

Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.

The problem with time-series data is that look-ahead bias is easy if one is not careful. In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps). Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.

Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly. It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.

Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.

Hyndman on cross-validation for time-series

answered Oct 08 '22 09:10

P. Garnry

Related questions
                            
                                How do I specify a local version of Node for a project?
                            
                                Getting hours,minutes, and seconds from Date? [duplicate]
                            
                                PHP: Variable function name (function pointer) called ; How to tell IDE my function is called?
                            
                                Order by value in spark pair RDD
                            
                                Changing Textbox text without firing TextChanged event
                            
                                Merge 2 columns into one in dataframe [closed]
                            
                                Create and download a CSV file from a Flask view
                            
                                MFC does not support WINVER less than 0x0501
                            
                                Spark application throws javax.servlet.FilterRegistration
                            
                                bootstrap datetimepicker styles not applied correctly
                            
                                How do i automatically scroll in a table view? (Swift)
                            
                                this value is null in function (React-Native)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With