Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv
I know how to fit these data to a multiple linear regression model using statsmodels.formula.api
:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()
However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
Using the second method I get the following error:
ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)
Why does it happen and how to fix it?
In this blog, we will learn about the Multiple Linear Regression Model and its implementation in Python. Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable.
Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model. The following example shows how to fix this error in practice.
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model.
It is an important regression algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting a best linear relationship.
When using sm.OLS(y, X)
, y
is the dependent variable, and X
are the
independent variables.
In the formula W ~ PTS + oppPTS
, W
is the dependent variable and PTS
and oppPTS
are the independent variables.
Therefore, use
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
instead of
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
import pandas as pd
import statsmodels.api as sm
NBA = pd.read_csv("NBA_train.csv")
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
yields
OLS Regression Results
==============================================================================
Dep. Variable: W R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 6799.
Date: Sat, 21 Mar 2015 Prob (F-statistic): 0.00
Time: 14:58:05 Log-Likelihood: -2118.0
No. Observations: 835 AIC: 4242.
Df Residuals: 832 BIC: 4256.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 41.3048 1.610 25.652 0.000 38.144 44.465
PTS 0.0326 0.000 109.600 0.000 0.032 0.033
oppPTS -0.0326 0.000 -110.951 0.000 -0.033 -0.032
==============================================================================
Omnibus: 1.026 Durbin-Watson: 2.238
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.984
Skew: 0.084 Prob(JB): 0.612
Kurtosis: 3.009 Cond. No. 1.80e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With