I am using Pandas 0.8.1, and at the moment I can't change the version. If a newer version will help the problem below, please note it in a comment rather than an answer. Also, this is for a research replication project, so even though re-running a regression after appending only one new data point might be silly (if the data set is large), I still have to do it. Thanks!
In Pandas, there is a rolling
option for the window_type
argument to pandas.ols
but it seems implicit that this requires some choice of a window size or use of the whole data sample as default. I'm looking to instead use all the data in a cumulative fashion.
I am trying to run a regression on a pandas.DataFrame
that is sorted by date. For each index i
, I want to run a regression using the data available from the minimum date up through the date at index i
. So the window effectively grows by one on every iteration, all data is cumulatively used from the earliest observation, and no data is ever dropped out of the window.
I have written a function (below) that works with apply
to perform this, but it is unacceptably slow. Instead, is there a way to use pandas.ols
to directly perform this sort of cumulative regression?
Here are some more specifics about my data. I have a pandas.DataFrame
containing a column of identifier, a column of dates, a column of left-hand-side values, and a column of right-hand-side values. I want to use groupby
to group based on the identifier, and then perform a cumulative regression for every time period consisting of the left-hand and right-hand-side variables.
Here is the function I am able to use with apply
on the identifier-grouped object:
def cumulative_ols(
data_frame,
lhs_column,
rhs_column,
date_column,
min_obs=60
):
beta_dict = {}
for dt in data_frame[date_column].unique():
cur_df = data_frame[data_frame[date_column] <= dt]
obs_count = cur_df[lhs_column].notnull().sum()
if min_obs <= obs_count:
beta = pandas.ols(
y=cur_df[lhs_column],
x=cur_df[rhs_column],
).beta.ix['x']
###
else:
beta = np.NaN
###
beta_dict[dt] = beta
###
beta_df = pandas.DataFrame(pandas.Series(beta_dict, name="FactorBeta"))
beta_df.index.name = date_column
return beta_df
Following on the advice in the comments, I created my own function that can be used with apply
and which relies on cumsum
to accumulate all the individual needed terms for expressing the coefficient from an OLS univariate regression vectorially.
def cumulative_ols(
data_frame,
lhs_column,
rhs_column,
date_column,
min_obs=60,
):
"""
Function to perform a cumulative OLS on a Pandas data frame. It is
meant to be used with `apply` after grouping the data frame by categories
and sorting by date, so that the regression below applies to the time
series of a single category's data and the use of `cumsum` will work
appropriately given sorted dates. It is also assumed that the date
conventions of the left-hand-side and right-hand-side variables have been
arranged by the user to match up with any lagging conventions needed.
This OLS is implicitly univariate and relies on the simplification to the
formula:
Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
Var(x) ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
beta ~ Cov(x,y) / Var(x)
and the code makes a further simplification be cancelling one factor
of (1/n).
Notes: one easy improvement is to change the date column to a generic sort
column since there's no special reason the regressions need to be time-
series specific.
"""
data_frame["xy"] = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
data_frame["x2"] = (data_frame[rhs_column]**2).fillna(0.0)
data_frame["yobs"] = data_frame[lhs_column].notnull().map(int)
data_frame["xobs"] = data_frame[rhs_column].notnull().map(int)
data_frame["cum_yobs"] = data_frame["yobs"].cumsum()
data_frame["cum_xobs"] = data_frame["xobs"].cumsum()
data_frame["cumsum_xy"] = data_frame["xy"].cumsum()
data_frame["cumsum_x2"] = data_frame["x2"].cumsum()
data_frame["cumsum_x"] = data_frame[rhs_column].fillna(0.0).cumsum()
data_frame["cumsum_y"] = data_frame[lhs_column].fillna(0.0).cumsum()
data_frame["cum_cov"] = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
data_frame["cum_x_var"] = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols
I have verified on numerous test cases that this matches the output of my former function and the output of NumPy's linalg.lstsq
function. I haven't done a full benchmark on the timing, but anecdotally, it is around 50 times faster in the cases I've been working on.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With