I'm trying to run a multi-variable regression and getting the error:
"ValueError: endog and exog matrices are different sizes"
My code snippet is below:
df_raw = pd.DataFrame(data=df_raw)
y = (df_raw['daily pct return']).astype(float)
x1 = (df_raw['Excess daily return']).astype(float)
x2 = (df_raw['Excess weekly return']).astype(float)
x3 = (df_raw['Excess monthly return']).astype(float)
x4 = (df_raw['Trading vol / mkt cap']).astype(float)
x5 = (df_raw['Std dev']).astype(float)
x6 = (df_raw['Residual risk']).astype(float)
y = y.replace([np.inf, -np.inf],np.nan).dropna()
print(y.shape)
print(x1.shape)
print(x2.shape)
print(x3.shape)
print(x4.shape)
print(x5.shape)
print(x6.shape)
df_raw.to_csv('Raw_final.csv', header=True)
result = smf.OLS(exog=y, endog=[x1, x2, x3, x4, x5, x6]).fit()
print(result.params)
print(result.summary())
As you can see from my code, I am checking the 'shape' of each variable. I get the following output which indicates the reason for the error is that the y variable has only 48392 values whereas all the others have 48393:
(48392,) (48393,) (48393,) (48393,) (48393,) (48393,) (48393,)
My dataframe looks something like the following:
daily pct return | Excess daily return | weekly pct return | index weekly pct return | Excess weekly return | monthly pct return | index monthly pct return | Excess monthly return | Trading vol / mkt cap | Std dev
------------------|---------------------|-------------------|-------------------------|----------------------|--------------------|--------------------------|-----------------------|-----------------------|-------------
| | | | | | | | 0.207582827 |
0.262658228 | 0.322397801 | | | | | | | 0.285585677 |
0.072681704 | 0.126445534 | | | | | | | 0.272920624 |
0.135514019 | 0.068778682 | | | | | | | 0.213149083 |
-0.115226337 | -0.173681889 | | | | | | | 0.155653699 |
-0.165116279 | -0.176569405 | | | | | | | 0.033925024 |
0.125348189 | 0.079889239 | | | | | | | 0.030968484 | 0.544133212
0.022277228 | -0.044949678 | | | | | | | 0.020735381 | 0.385659608
0.150121065 | 0.102119782 | | | | | | | 0.063563881 | 0.430868447
0.336842105 | 0.333590483 | | | | | | | 0.210193049 | 0.893734807
0.011023622 | -0.011860658 | 0.320987654 | -0.657089012 | 0.978076666 | | | | 0.100468109 | 1.137976483
0.37694704 | 0.308505907 | | | | | | | 0.135828281 | 1.867394416
Does anyone have an elegant solution to align the sizes of the matrices so I no longer receive this error? I think I need to drop the first row of values APART from the y variable ('daily pct return') but I'm uncertain how I can achieve this?
Thanks in advance!!
Finally got to the problem! There were three issues:
1) The y variable was of size 48392 whereas the other 6 variables were all of size 48393. To fix this I included the following line of code to drop the 1st row:
df_raw = df_raw.drop([0])
2) My dataframe had lots of empty cells. You can't perform a regression unless every cell has a value in it. So I included some code to replace all infs and empty cells with NaN and then fill all NaNs with a 0 value. Code snippet:
df_raw ['daily pct return']= df_raw ['daily pct return'].replace([np.inf, -np.inf],np.nan)
df_raw = df_raw.replace(r'\s+', np.nan, regex=True).replace('', np.nan)
df_raw.fillna(value=0, axis=1,inplace=True)
3) The way I'd written the multi-regression formula was wrong. I corrected it as follows:
result = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6', data=df_raw).fit()
So in summary, my updated code is now as follows:
df_raw = pd.DataFrame(data=df_raw)
df_raw = df_raw.drop([0])
df_raw ['daily pct return']= df_raw ['daily pct return'].replace([np.inf, -np.inf],np.nan)
df_raw = df_raw.replace(r'\s+', np.nan, regex=True).replace('', np.nan)
df_raw.fillna(value=0, axis=1,inplace=True)
df_raw.to_csv('Raw_final.csv', header=True)
# Define variables for regression
y = (df_raw['daily pct return']).astype(float)
x1 = (df_raw['Excess daily return']).astype(float)
x2 = (df_raw['Excess weekly return']).astype(float)
x3 = (df_raw['Excess monthly return']).astype(float)
x4 = (df_raw['Trading vol / mkt cap']).astype(float)
x5 = (df_raw['Std dev']).astype(float)
x6 = (df_raw['Residual risk']).astype(float)
# Check shape of variables to confirm they are of the same size
print(y.shape)
print(x1.shape)
print(x2.shape)
print(x3.shape)
print(x4.shape)
print(x5.shape)
print(x6.shape)
# Perform regression
result = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6', data=df_raw).fit()
print(result.params)
print(result.summary())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With