Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Two approaches on forecasting monthly sales data with Support Vector Machines

I have a question regarding time series and SVM. I've asked the mighty internet but unfortunately the information is scarce and is mostly concerned with trading data.

My situation is the following: At the current time I try to make a switch from Arima forecasts to more sophisticated models. Currently I try to understand and implement an SVM model. I found some data about monthly sales of asian cars in the US market. Now I experiment with this data.

First I approach time series forecasting with SVR / SVM with two different routines. Next I implement an absolutly simple auto.arima on the same data. Finally I compare the residuals of these 3 approaches.

My questions are: Am I heading in the right direction with these implementations? How can I improve the SVM models? Is there more information concerning forecasting other than financial data?

Let's start with a small workaround to construct my input matrix (data from goodcarbadcar.net):

library(zoo)
library(e1071)
library(quantmod)
library(kernlab)
library(caret)
library(forecast)

Date <-c( "2010-01-01", "2010-02-01", "2010-03-01", "2010-04-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-08-01", "2010-09-01",
    "2010-10-01", "2010-11-01", "2010-12-01", "2011-01-01", "2011-02-01", "2011-03-01", "2011-04-01", "2011-05-01", "2011-06-01",
    "2011-07-01", "2011-08-01", "2011-09-01", "2011-10-01", "2011-11-01", "2011-12-01", "2012-01-01", "2012-02-01", "2012-03-01",
    "2012-04-01", "2012-05-01", "2012-06-01", "2012-07-01", "2012-08-01", "2012-09-01", "2012-10-01", "2012-11-01", "2012-12-01",
    "2013-01-01", "2013-02-01", "2013-03-01", "2013-04-01", "2013-05-01", "2013-06-01", "2013-07-01", "2013-08-01", "2013-09-01",
    "2013-10-01", "2013-11-01", "2013-12-01", "2014-01-01", "2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01",
    "2014-07-01", "2014-08-01", "2014-09-01", "2014-10-01", "2014-11-01", "2014-12-01", "2015-01-01", "2015-02-01", "2015-03-01",
    "2015-04-01", "2015-05-01", "2015-06-01", "2015-07-01", "2015-08-01", "2015-09-01", "2015-10-01", "2015-11-01", "2015-12-01",
    "2016-01-01", "2016-02-01", "2016-03-01")

Nissan <- c( 55861, 63148, 85526, 56558, 75673, 56266, 72573, 67399, 65900, 61843, 63184, 81228, 64442, 83226, 109854, 64765, 69759,
            65659, 77191, 82517, 84485, 75484, 76754, 89937, 72517, 97492, 126132, 64200, 81202, 81801, 86722, 87360, 82462, 70928,
            84300, 86663, 73793, 90489, 126623, 80003, 106558, 95010, 101279, 108614, 77828, 81866, 93376, 96526, 81472, 105631, 136642,
            94764, 125558, 101069, 112914, 125224, 95118, 94072, 91790, 105311, 94449, 106777, 132560, 99869, 124305, 114243, 120439,
            122716, 111562, 104904, 95389, 124207, 97220, 120540, 149784,)

Mitsubishi <- c( 4170, 4019, 5434, 3932, 4737, 4198, 5648, 4293, 4961, 5111, 4306, 4874, 5714, 6893, 7560, 8081, 7568, 8299, 7972,
                7985, 5803, 4378, 3735, 5032, 4711, 4736, 7160, 5280, 5575, 5411, 4194, 4249, 4806, 3981, 3574, 4113, 4659, 6051,
                5286, 4461, 4715, 5297, 5230, 5281, 4001, 4752, 6071, 6423, 4867, 5977, 8996, 6542, 7269, 6021, 6349, 6786, 5558,
                6199, 6534, 6545, 1112, 1184, 1715, 1933, 1996, 1982, 2052, 2320, 2066, 1984, 1637, 1403, 1288, 1547, 2123)

mydata <- data.frame(Date, Nissan, Mitsubishi)
mydata$Date <- as.Date(mydata$Date, format = "%Y-%m-%d")
mydata <- xts(mydata[,-1], order.by = mydata[,1])

Thus I have the same input as with the .csv import. Next step is to define a data.frame which will be the basis for further analysis.

Let us assume that sales of Nissan at time t depend on sales at time t-1, t-2 and t-3. Furthermore assume that sales of Nissan at time t also depend on Mitsubishi at t-3

Now to the first approach. Here I use Time Slices

####################
#Use SVR  technique#
####################

Nissan <- data$Nissan
Mitsubishi <- data$Mitsubishi

#Assume dependency on Nissan Lag1 + Lag2 and Mitsubishi Lag1
feature = merge(lag(Nissan,1),lag(Nissan,2), lag(Nissan,3),
             lag(Mitsubishi,3),
             all=FALSE)

colnames(feature) = c("n.lag.1", "n.lag.2", "n.lag.3",
                      "m.lag.3")

#TARGET to predict: Nissan
dataset = na.trim(merge(feature,Nissan,all=FALSE))

#Label columns of dataset
colnames(dataset) = c("n.lag.1", "n.lag.2", "n.lag.3",
                      "m.lag.3",
                      "TARGET")

#################
#Use Time Slices#
#################

myTimeControl <- trainControl(method = "timeslice",
                              initialWindow = 48,
                              horizon = 6,
                              fixedWindow = TRUE)

TimeModel <- train(TARGET ~ .,
                     data = dataset,
                     method = "pls",
                     preProc = c("center", "scale"),
                     trControl = myTimeControl)
TimeModel

####################################
#Predict with control data set 2016#
####################################

#Define the test set
control.feature <- merge(lag(mydata$Nissan["2010/2016"],1), lag(mydata$Nissan["2010/2016"],2), lag(mydata$Nissan["2010/2016"],3),
                         lag(mydata$Mitsubishi["2010/2016"],3),
                         all = FALSE)

colnames(control.feature) = c("n.lag.1", "n.lag.2", "n.lag.3",
                      "m.lag.3")

#Make a prediction
svr.fc <- predict(TimeModel, control.feature["2016"])

#Show SVR Residuals

Now I would like to show my second approach relying on the package e1071

####################
#Use  Package e1071#
####################

#initialize svm model
nissan.model <- svm(TARGET ~ ., dataset)

#test model on the existing set
nissanY <- predict(nissan.model, dataset)
plot(index(dataset),dataset[,ncol(dataset)], pch=16)
points(index(dataset),nissanY, col="red", pch=4)

#predict 2016 values and compare with actuals
predictY <- predict(nissan.model, control.feature["2016"])
mydata$Nissan["2016"] - predictY

#tune the existing model with grid search
nissan.tuneResult <- tune(svm, TARGET ~ ., data = dataset,
                     ranges = list(epsilon = seq(0,1, 0.01), cost=2^(2:9)))
print(nissan.tuneResult)
plot(nissan.tuneResult)

#initialize tuned model
tuned.nissan.model <- nissan.tuneResult$best.model
tuned.nissanY <- predict(tuned.nissan.model, dataset)

plot(index(dataset),dataset[,ncol(dataset)], pch=16)
points(index(dataset),tuned.nissanY, col="red", pch=4)

#compare 2016 forecast and actual values
tuned.predictY <- predict(tuned.nissan.model, control.feature["2016"])

Last but not least I present my vanilla auto.arima.

############################
#Use time series techniques#
############################

#Define time series which should be forecasted
nissan.ts <- ts(data$Nissan, frequency = 12, start = c(2010,1))

#Assume that Nissan depends on Mitsubishi
xreg <- data.frame(data$Mitsubishi)

#auto.arima on Nissan with XREG = Mitsubishi
arima.fit <- auto.arima(nissan.ts, D = 1, xreg = xreg)
arima.fc <- forecast(arima.fit, xreg = xreg)

#How did the ARIMA forecast perform in the first quarter of 2016
comparison.arima <- actual$Nissan - arima.fc$mean[1:3]
rmse.arima <- sqrt(sum((actual$Nissan - arima.fc$mean[1:3])^2/3))

And finally the comparison of the residuals from these 3 approaches

#Print the details ARIMA
comparison.arima
#Print details e1071
mydata$Nissan["2016"] - tuned.predictY
#Print details Time Slices
mydata$Nissan["2016"] - svr.fc

My own conclusions are:

  1. auto.arima uses (2,1,0) model. I have 3 lags in the SVM model. Thus the difference
  2. auto.arima does not use lag(mitsubishi,3) but just xreg(mitsubishi)
  3. both SVM models just use some assumptions without further specification
  4. eventually the data quantity is not high enough for SVM

I would be glad if you could give me some hints on my questions or even more discuss the models. Best regards

Alex

like image 541
Alex Avatar asked Nov 09 '22 16:11

Alex


1 Answers

Alex, Why are you assuming that Mitsubishi has an impact on Y. Bad things happen when you force uncorrelated data together. I plotted the Y and X in a normalized scatterplot with X shifted down 3 periods(ie losing the first 3 observatiions). This doesn't look correlated which might be part of your problem.

enter image description here

The Mitsubishi data had a major decrease at period 61 and Nissan was unaffected. This supports that there is no relationship. There are so many real good causal examples you might want to try first (Lydia Pinkham or Gas Furnace with Air & Methane from Box-Jenkins )

like image 56
Tom Reilly Avatar answered Nov 15 '22 08:11

Tom Reilly