When I use the acf
function in R it plots horizontal lines that represent the confidence interval (95% by default) for the autocorrelations at various lags:
However, when I use statsmodels.graphics.tsaplots.plot_acf
in python I see a curved confidence interval based on a more sophisticated computation:
Notice that in the R version, the lags up through lag 25 are considered significant. For the same data, in the python version, only the lags up through 20 are considered significant.
What is the difference between these two methods, and which one should I trust more? Can someone explain the theory of the non-constant confidence interval computed by statsmodels.tsa.stattools.acf
?
I know I can reproduce the R horizontal lines by simply plotting something like y=[+/-]1.96 / np.sqrt(len(data))
. However, I'd like to understand the fancy curved confidence interval.
It has been shown that the autocorrelation coefficient r(k)
follows a Gaussian distribution with variance Var(r(k))
.
As you have found, in R, the variance is simply calculated as Var(r(k)) = 1/N
for all k
. While, in python, the variance is calculated using Bartlett’s formula, where Var(r(k)) = 1/N (1 + 2(r(1)^2+r(2)^2+...+r(k-1)^2))
. This results in the first increasing, then flattening confidence level shown above.
Source code of ACF variances in python:
varacf = np.ones(nlags + 1) / nobs
varacf[0] = 0
varacf[1] = 1. / nobs
varacf[2:] *= 1 + 2 * np.cumsum(acf[1:-1]**2)
These two distinct formulas are based on different assumptions. The former assumes an i.i.d process and r(k) = 0
for all k != 0
, while the later assumes a MA process with order of k-1
where ACF "cuts tail" after lag k
.
Not really an answer to the theory part of this (which might be better on CrossValidated), but maybe useful ... ?
If you go to the documentation page for statsmodels.tsa.stattools.acf it gives you an option to browse the source code. The code there is:
varacf = np.ones(nlags + 1) / nobs
varacf[0] = 0
varacf[1] = 1. / nobs
varacf[2:] *= 1 + 2 * np.cumsum(acf[1:-1]**2)
interval = stats.norm.ppf(1 - alpha / 2.) * np.sqrt(varacf)
confint = np.array(lzip(acf - interval, acf + interval))
In contrast, the R source code for plot.acf shows
clim0 <- if (with.ci) qnorm((1 + ci)/2)/sqrt(x$n.used) else c(0, 0)
where ci
is the confidence level (default=0.95).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With