Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv
I know how to fit these data to a multiple linear regression model using statsmodels.formula.api:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()
However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")    
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
Using the second method I get the following error:
ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)
Why does it happen and how to fix it?
In this blog, we will learn about the Multiple Linear Regression Model and its implementation in Python. Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable.
Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model. The following example shows how to fix this error in practice.
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model.
It is an important regression algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting a best linear relationship.
When using sm.OLS(y, X), y is the dependent variable, and X are the
independent variables.
In the formula W ~ PTS + oppPTS, W is the dependent variable and PTS and oppPTS are the independent variables.
Therefore, use
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
instead of
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
import pandas as pd
import statsmodels.api as sm
NBA = pd.read_csv("NBA_train.csv")    
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
yields
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     6799.
Date:                Sat, 21 Mar 2015   Prob (F-statistic):               0.00
Time:                        14:58:05   Log-Likelihood:                -2118.0
No. Observations:                 835   AIC:                             4242.
Df Residuals:                     832   BIC:                             4256.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         41.3048      1.610     25.652      0.000        38.144    44.465
PTS            0.0326      0.000    109.600      0.000         0.032     0.033
oppPTS        -0.0326      0.000   -110.951      0.000        -0.033    -0.032
==============================================================================
Omnibus:                        1.026   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.599   Jarque-Bera (JB):                0.984
Skew:                           0.084   Prob(JB):                        0.612
Kurtosis:                       3.009   Cond. No.                     1.80e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With