Multiple linear regression in pandas statsmodels: ValueError

Tags:

python

pandas

Data: https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv

I know how to fit these data to a multiple linear regression model using statsmodels.formula.api:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()

However, I find this R-like formula notation awkward and I'd like to use the usual pandas syntax:

import pandas as pd
NBA = pd.read_csv("NBA_train.csv")    
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

Using the second method I get the following error:

ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)

Why does it happen and how to fix it?

740

asked Mar 21 '15 18:03

alkamid

1 Answers

When using sm.OLS(y, X), y is the dependent variable, and X are the independent variables.

In the formula W ~ PTS + oppPTS, W is the dependent variable and PTS and oppPTS are the independent variables.

Therefore, use

y = NBA['W']
X = NBA[['PTS', 'oppPTS']]

instead of

X = NBA['W']
y = NBA[['PTS', 'oppPTS']]

import pandas as pd
import statsmodels.api as sm

NBA = pd.read_csv("NBA_train.csv")    
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()

yields

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      W   R-squared:                       0.942
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     6799.
Date:                Sat, 21 Mar 2015   Prob (F-statistic):               0.00
Time:                        14:58:05   Log-Likelihood:                -2118.0
No. Observations:                 835   AIC:                             4242.
Df Residuals:                     832   BIC:                             4256.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         41.3048      1.610     25.652      0.000        38.144    44.465
PTS            0.0326      0.000    109.600      0.000         0.032     0.033
oppPTS        -0.0326      0.000   -110.951      0.000        -0.033    -0.032
==============================================================================
Omnibus:                        1.026   Durbin-Watson:                   2.238
Prob(Omnibus):                  0.599   Jarque-Bera (JB):                0.984
Skew:                           0.084   Prob(JB):                        0.612
Kurtosis:                       3.009   Cond. No.                     1.80e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

101

answered Nov 10 '22 19:11

unutbu

Related questions
                            
                                Python 3 backward compatability (shlex.quote vs pipes.quote)
                            
                                How can I safely check if a python package is outdated?
                            
                                Trouble installing scikit-bio on Windows
                            
                                Shifting an image in numpy
                            
                                Why isn't range getting exhausted in Python-3?
                            
                                How to tell when a method is called for first time of many
                            
                                Fastest way to check does string contain any word from list
                            
                                Idiomatically negate a filter
                            
                                How to subset a data frame using Pandas based on a group criteria?
                            
                                django run localhost from another computer connected to another network
                            
                                Python encoding/decoding problems
                            
                                Error installing TA-Lib for Anaconda
                            
                                Get the value of a ctypes.c_ulong pointer?
                            
                                Is it possible to perform a parameter sensitivity analysis using python?
                            
                                normalize a matrix row-wise in theano
                            
                                Installing numpy from wheel format: "...is not a supported wheel on this platform"
                            
                                Pyenv not auto activating
                            
                                python pandas dataframe join two dataframes [duplicate]
                            
                                How to set alpha value of a pixel in Python
                            
                                Extract title tag with BeautifulSoup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With