I'm trying to do a time series analysis using randomforest. PFB my code
Subsales<-read.csv('Sales.csv')
head(Subsales)
Sample Data:
Date SKU City Sales
<date> <chr> <chr> <dbl>
1 2014-08-11 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 378
2 2014-08-18 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 348
3 2014-08-25 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 314
4 2014-09-01 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 324
5 2014-09-08 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 352
6 2014-09-15 Vaseline Petroleum Jelly Pure 60 ml Jeddah1 453
####Length of training & testing set Splitting it 80-20####
train_len=round(nrow(SubSales)*0.8)
test_len=nrow(SubSales)
######Splitting dataset into training and testing#####
#### Training Set
training<-slice(SubSales,1:train_len)
#### Testing Set
testing<-slice(SubSales,train_len+1:test_len)
training=training[c(1,4)]
testing=testing[c(1,4)]
library(randomForest)
set.seed(1234)
regressor = randomForest(formula=Sales~.,
data=training,
ntree=100)
y_pred = predict(regressor,newdata = testing)
I'm getting a stationary result when I use the predict function on test data set.All predicted values are 369, I've tried for another data set I got the same result. Can anyone tell me what am I doing wrong here?
Let me try to rephrase your question to make sure I accurately understand what you want to do.
You have sales per day for a product, and you would like to predict sales as a function of the date into the future. You do NOT have any predictive variables, such as number of customers, amount spent on advertising, or anything else. Your input data looks like this:
Date Sales
2014-08-11 378
2014-08-18 348
2014-08-25 314
2014-09-01 324
2014-09-08 352
2014-09-15 453
...
I think your RandomForest is behaving as expected. Random forest is a supervised machine learning algorithm that tries to predict y
(response, here: Sales) given input variables x
(predictors). Here, the only x
you supply is date. However, each date is completely new to the random forest and the algorithm can therefore only guess that sales of your product on that day will be average.
You have two options:
Option 1) Stick with your approach of only using dates as predictors. You will need a different method, perhaps an autoregression approach such as ARIMA. This approach tries to detect trends in the data. Are sales more or less static, growing, or going down? Is there a weekly trend, a monthly trend, an annual trend? An example to get you started can be found here
Option 2) Use data collection and feature engineering to create features that help your RandomForest to predict values for new dates. For example, try to get data on how many customers came to the store on any given day, or extract the day of the week (Monday, Tuesday, ...) and keep that as a separate variable. The R-package lubridate will help you do this. A brief example below:
library(lubridate)
Subsales <- mutate(Subsales, Weekday = wday(Date, label = TRUE))
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With