as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it. Pandas tells me doom is in the works:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
in fact, if you do something like:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.
I created an ols
module designed to mimic pandas' deprecated MovingOLS
; it is here.
It has three core classes:
OLS
: static (single-window) ordinary least-squares regression. The output are NumPy arraysRollingOLS
: rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.PandasRollingOLS
: wraps the results of RollingOLS
in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS
takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper
.
An example:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
Rolling beta with sklearn
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
Adding for completeness a speedier numpy
-only solution which limits calculations only to the regression coefficients and the final estimate
Numpy rolling regression function
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
Notes
In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols)
appears to be somewhat slow
As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy
's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels
. This product is expressed in the line starting as coeff = ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With