Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple linear regression in pandas statsmodels: ValueError

Tags:

python

pandas

Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv

I know how to fit these data to a multiple linear regression model using statsmodels.formula.api:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()

However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")    
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

Using the second method I get the following error:

ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)

Why does it happen and how to fix it?

like image 740
alkamid Avatar asked Mar 21 '15 18:03

alkamid


People also ask

What is multiple linear regression model in Python?

In this blog, we will learn about the Multiple Linear Regression Model and its implementation in Python. Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable.

Why is my NumPy data not fitting my regression model?

Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model. The following example shows how to fix this error in practice.

Why is my pandas data being cast to NumPy?

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray (data). This error occurs when you attempt to fit a regression model in Python and fail to convert categorical variables to dummy variables first before fitting the model.

What is the use of it in regression analysis?

It is an important regression algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting a best linear relationship.


1 Answers

When using sm.OLS(y, X), y is the dependent variable, and X are the independent variables.

In the formula W ~ PTS + oppPTS, W is the dependent variable and PTS and oppPTS are the independent variables.

Therefore, use

y = NBA['W']
X = NBA[['PTS', 'oppPTS']]

instead of

X = NBA['W']
y = NBA[['PTS', 'oppPTS']]

import pandas as pd
import statsmodels.api as sm

NBA = pd.read_csv("NBA_train.csv")    
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

yields

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     6799.
Date:                Sat, 21 Mar 2015   Prob (F-statistic):               0.00
Time:                        14:58:05   Log-Likelihood:                -2118.0
No. Observations:                 835   AIC:                             4242.
Df Residuals:                     832   BIC:                             4256.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         41.3048      1.610     25.652      0.000        38.144    44.465
PTS            0.0326      0.000    109.600      0.000         0.032     0.033
oppPTS        -0.0326      0.000   -110.951      0.000        -0.033    -0.032
==============================================================================
Omnibus:                        1.026   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.599   Jarque-Bera (JB):                0.984
Skew:                           0.084   Prob(JB):                        0.612
Kurtosis:                       3.009   Cond. No.                     1.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
like image 101
unutbu Avatar answered Nov 10 '22 19:11

unutbu