I'm new to Granger Causality and would appreciate any advice on understanding/interpreting the results of the python statsmodels output. I've constructed two data sets (sine functions shifted in time with noise added)
and put them in a "data" matrix with signal 1 as the first column and signal 2 as the second. I then ran the tests using:
granger_test_result = sm.tsa.stattools.grangercausalitytests(data, maxlag=40, verbose=True)`
The results showed that the optimal lag (in terms of the highest F test value) were for a lag of 1.
Granger Causality
('number of lags (no zero)', 1)
ssr based F test: F=96.6366 , p=0.0000 , df_denom=995, df_num=1
ssr based chi2 test: chi2=96.9280 , p=0.0000 , df=1
likelihood ratio test: chi2=92.5052 , p=0.0000 , df=1
parameter F test: F=96.6366 , p=0.0000 , df_denom=995, df_num=1
However, the lag that seems to best describe the optimal overlap of the data is around 25 (in the figure below, signal 1 has been shifted to the right by 25 points):
Granger Causality
('number of lags (no zero)', 25)
ssr based F test: F=4.1891 , p=0.0000 , df_denom=923, df_num=25
ssr based chi2 test: chi2=110.5149, p=0.0000 , df=25
likelihood ratio test: chi2=104.6823, p=0.0000 , df=25
parameter F test: F=4.1891 , p=0.0000 , df_denom=923, df_num=25
I'm clearly misinterpreting something here. Why wouldn't the predicted lag match up with the shift in the data?
Also, can anyone explain to me why the p-values are so small as to be negligible for most lag values? They only begin to show up as non-zero for lags greater than 30.
Thanks for any help you can give.
Granger Causality Test The row are the response (y) and the columns are the predictors (x). If a given p-value is < significance level (0.05), for example, take the value 0.0 in (row 1, column 2), we can reject the null hypothesis and conclude that walmart_x Granger causes apple_y.
The Granger causality test is a statistical hypothesis test for determining whether one time series is useful for forecasting another. If probability value is less than any level, then the hypothesis would be rejected at that level.
The R function is: granger. test(y, p) , where y is a data frame or matrix, and p is the lags. The null hypothesis is that the past p values of X do not help in predicting the value of Y.
As stated here, in order to run a Granger Causality test, the time series' you are using must be stationary. A common way to achieve this is to transform both series by taking the first difference of each:
x = np.diff(x)[1:]
y = np.diff(y)[1:]
Here is the comparison of Granger Causality results at lag 1 and lag 25 for the similar dataset I generated:
Unchanged
Granger Causality
number of lags (no zero) 1
ssr based F test: F=19.8998 , p=0.0000 , df_denom=221, df_num=1
ssr based chi2 test: chi2=20.1700 , p=0.0000 , df=1
likelihood ratio test: chi2=19.3129 , p=0.0000 , df=1
parameter F test: F=19.8998 , p=0.0000 , df_denom=221, df_num=1
Granger Causality
number of lags (no zero) 25
ssr based F test: F=6.9970 , p=0.0000 , df_denom=149, df_num=25
ssr based chi2 test: chi2=234.7975, p=0.0000 , df=25
likelihood ratio test: chi2=155.3126, p=0.0000 , df=25
parameter F test: F=6.9970 , p=0.0000 , df_denom=149, df_num=25
1st Difference
Granger Causality
number of lags (no zero) 1
ssr based F test: F=0.1279 , p=0.7210 , df_denom=219, df_num=1
ssr based chi2 test: chi2=0.1297 , p=0.7188 , df=1
likelihood ratio test: chi2=0.1296 , p=0.7188 , df=1
parameter F test: F=0.1279 , p=0.7210 , df_denom=219, df_num=1
Granger Causality
number of lags (no zero) 25
ssr based F test: F=6.2471 , p=0.0000 , df_denom=147, df_num=25
ssr based chi2 test: chi2=210.3621, p=0.0000 , df=25
likelihood ratio test: chi2=143.3297, p=0.0000 , df=25
parameter F test: F=6.2471 , p=0.0000 , df_denom=147, df_num=25
I'll try to explain what is happening conceptually. Due to the series you are using having a clear trend in mean, the early lags at 1, 2, ... etc. all give significant predictive models in the F test. This is because you can negatively correlate the x
values 1 lag a way with the y
values very easily, due to the long term trend. Additionally (this one is more of an educated guess), I think the reason you see the F statistic for lag 25 very low compared to the early lags is that a lot of the variance explained by the x
series is contained in the auto-correlation of y
from lags 1-25, since the non-stationarity gives the auto correlation more predictive power.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With