I'm using a big (isplit) loop on a huge set of Time Series for testing on ARIMA models. For this I'm using the auto.arima
function from the forecast
package.
For this I created a function, to traverse all the time series while keeping track of progress and store fitted models and stats (such as accuracy and model parameters). Now I'm dealing with an error generated by the auto.arima
function. To be more precise; caused by the OCSB seasonal testing.
I'm using this function for 'monthly' time series as well as for 'weekly' time series. For monthly time series a have no problems (almost 50000, including a lot with 'zero' values). With the weekly time series I ran into the problem. But I'm not able to find the real cause of the Error.
I tried to recreate the error. I though about it had something to do with a lot of 0 (or the same) values in combination with the 52 frequency period. But I still can't point the finger at the problem.
See the examples below. Some info: the set of time series are weekly values (freq=52), starting in 2010, week 1. The length is 122 samples (until 2012, week 18). Therefore I tested with lengths of 122, for which I can generate the error. I still think it has something to do with the frequency and 'running same values'...
For some the error is generated, for some not.
Example 1 [random numbers, length=122] > no problem:
ts_element <- ts(sample(0:30, 122, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Example 2 [only 0 values, length=122] > the OCSB test Error (normally I would assume a different error...see example 3):
ts_element <- ts(sample(0:0, 122, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Error in OCSBtest(x, m) : subscript out of bounds
Example 3 [only 0 values, length=100] > the 'zero/equal values' error, I assumed this one, this example is not the problem, but to point out that the length is relevant (compare with example 2):
ts_element <- ts(sample(0:0, 100, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Error in if (PVAL == min(tablep)) warning("p-value smaller than printed p-value") else warning("p-value greater than printed p-value") :
missing value where TRUE/FALSE needed
Example 4 [almost the same as ex.3, but with one non-0 value, length=100] > no problem anymore:
ts_element[30] <- 1
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Example 5 [almost the same as ex.4, but length=122] > the OCSB test Error:
ts_element <- ts(sample(0:0, 122, replace=TRUE), frequency = 52, start = c(2010, 1))
ts_element[30] <- 1
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Error in OCSBtest(x, m) : subscript out of bounds
Example 6 [random 1's and 0's, length=122] > no problem:
ts_element <- ts(sample(0:1, 122, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Example 7 [random numbers, smaller length of 50] > no problem:
ts_element <- ts(sample(1:34, 50, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
Does anybody have an idea what the cause of this the OCSB out of bounds Error is? How to recognize?
The main problem is that whenever this error occurs in the function I described in the beginning of this post, the function doesn't output all the information I'm gathering. So the hours of waiting is for nothing. So if the root cause cannot be found, I'm also very much helped with some code to deal with Errors in a way to 'ignore' them (skip that time series) and go further. Or ignore, but still output the information gathered at that moment.
How has the solution?
Note: the zero-error is not a problem. I'm covering this in my function.
Nice question, and well explained. You clearly thought this through before you submitted.
The problems in your examples are due to a number of issues (and in my opinion, bugs) around dealing with time series full of zeroes.
In general, you should use the debug
command to step through your code. For example, try debugging the five main functions that run for the auto.arima
:
debug(auto.arima)
debug(nsdiffs)
debug(forecast:::OCSBtest)
debug(lm)
debug(lm.fit)
(use Q
to exit and undebug
to stop debugging a function) and then try running your code in Example 2
ts_element <- ts(sample(0:0, 122, replace=TRUE), frequency = 52, start = c(2010, 1))
fit <- auto.arima(ts_element, trace=FALSE, seasonal.test="ocsb", allowdrift=TRUE, stepwise=TRUE)
After a lot pressing Enter
you will finally reach the point where R fails. In this case, it is a rather deep and nasty bug in lm.fit
. If all the coefficients are zero, then for some reason it converts them to NA
. When the OCSBtest
function tries to pull out the coefficients, it find the matrix empty, and tells you that it is not an appropriate index.
I would tell you to report this to R-bugs... but they can be pretty snippy when it comes to bugs in base
. They will probably tell you it is "user error" and that you shouldn't be fitting regression models to all zeroes (sigh).
The first problem with Example 3 appears to be a feature undocumented in the nsdiffs
page, which describes the forecast::OCSBtest
function. It looks like your time series must be bigger than 2 times the period + 5, or the seasonal differencing will not be run. In Example 2 this was true, but not in Example 3. Indeed, the very first bit of the code in the function is:
if (length(time.series) < (2 * period + 5)) {
return(0)
}
Read up in the two Osborn references listed in the nsdiffs
page, maybe it mentions it there somewhere. It would be a good idea to let the authors of forecast
know so they could include it in the documentation somewhere. Maybe even throw a warning, with an option to turn it off.
Example 3 has a different error than Example 2 is because Example 3 immediately exits the nsdiffs
function, and then goes on to fail in the ndiff
function, which does the differencing. ndiff
appears to have a bug in which if the sum of square differences is zero (because the series is zero), then it causes a divide by zero error. Here is the relevant code in the ndiff
function:
s2 <- .C("R_pp_sum", as.vector(e, mode = "double"), as.integer(n), as.integer(l), s2 = as.double(s2), PACKAGE = "tseries")$s2
STAT <- eta/s2 # Becomes NaN
PVAL <- approx(table, tablep, STAT, rule = 2)$y # Also NaN
if (is.na(approx(table, tablep, STAT, rule = 1)$y)) if (PVAL ==
min(tablep)) warning("p-value smaller than printed p-value") else warning("p-value greater than printed p-value") # Bombs
Example 4 succeeds because s2
is never zero. A simple fix would be to check if s2
was zero before dividing.
Example 5 fails for much the same reason as Example 2. It enters the nsdiff
function because it has a length of more than 2*period+5
, and then fails because lm.fit
doesn't return the coefficients when they are all zero.
Example 6 succeeds because lm.fit
will now properly return the coefficients, because they are not all zero, because your time series is mixed with ones and zeroes.
Example 7 succeeds because nsdiff
is not run (because the series is too small) and ndiff
will no longer cause a divide by zero, because the sum of squared differences will not be zero.
In conclusion, your examples have shown two bugs. One in ndiff
when the time series is always zero, and another in the lm.fit
function when the covariates are all zero. Also, the documentation in nsdiff
should be updated to tell you that it won't be run if the time series has a length of less than 2*period+5
if you use the 'ocsb' option (but maybe this is documented in the references).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With