I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr
function built into Pandas, and the acf
function supplied by statsmodels.tsa
. This is done in the following MWE:
import pandas as pd from pandas_datareader import data import matplotlib.pyplot as plt import datetime from dateutil.relativedelta import relativedelta from statsmodels.tsa.stattools import acf, pacf ticker = 'AAPL' time_ago = datetime.datetime.today().date() - relativedelta(months = 6) ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna() ticker_data_len = len(ticker_data) ticker_data_acf_1 = acf(ticker_data)[1:32] ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)] test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr'] test_df.index += 1 test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
What accounts for this difference, and which values should be used?
ACF plot is a bar chart of coefficients of correlation between a time series and it lagged values. Simply stated: ACF explains how the present value of a given time series is correlated with the past (1-unit past, 2-unit past, …, n-unit past) values.
Autocorrelation measures the relationship between a variable's current value and its past values. > An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation of negative 1 represents a perfect negative correlation.
The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
autocorr
does nothing more than passing subseries of the original series to np.corrcoef
. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficientacf
, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.The differences may get smaller for longer time series but are quite big for short ones.
Compared to Matlab, the Pandas autocorr
function probably corresponds to doing Matlabs xcorr
(cross-corr) with the (lagged) series itself, instead of Matlab's autocorr
, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
See this MWE for clarification:
import numpy as np import pandas as pd from statsmodels.tsa.stattools import acf import matplotlib.pyplot as plt plt.style.use("seaborn-colorblind") def autocorr_by_hand(x, lag): # Slice the relevant subseries based on the lag y1 = x[:(len(x)-lag)] y2 = x[lag:] # Subtract the subseries means sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2))) # Normalize with the subseries stds return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2)) def acf_by_hand(x, lag): # Slice the relevant subseries based on the lag y1 = x[:(len(x)-lag)] y2 = x[lag:] # Subtract the mean of the whole series x to calculate Cov sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x))) # Normalize with var of whole series return sum_product / ((len(x) - lag) * np.var(x)) x = np.linspace(0,100,101) results = {} nlags=10 results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)] results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)] results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)] results["acf"] = acf(x, unbiased=True, nlags=nlags-1) pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True) plt.xlabel("lag") plt.ylim([-1.2, 1.2]) plt.ylabel("value") plt.show()
Statsmodels uses np.correlate
to optimize this, but this is basically how it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With