Statsmodel Multiple Linear Regression Error - Python

Question

I am running (what I think is) as fairly straightforward multiple linear regression model fit using Stats model.

My code is as follows:

y = 'EXITS|20:00:00'
all_columns = "+".join(y_2015piv.columns - ['EXITS|20:00:00'])
reg_formula = "y~" + all_columns

lm= smf.ols(formula=reg_formula, data=y_2015piv).fit()

Because I have about 30 factor variables I'm creating the formula using Python string manipulation. "y" is as presented above. all_columns is the dataframe y_2015piv columns without "y".

This is all_columns:

DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

The values in the dataframe are continuous numerical variables and 0/1 dummy variables.

When I try and fit the model I get this error:

PatsyError: numbers besides '0' and '1' are only allowed with **
    y~DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

There is nothing on line that addresses what this could be. Any help appreciated.

By the way, when I fit this model in Scikit-learn it works fine. So I figure the data is in order.

Thanks in advance.

Learner · Accepted Answer

The first error that I got was this:

PatsyError: numbers besides '0' and '1' are only allowed with **
Temp ~ MEI+ CO2+ CH4+ N2O+ CFC-11+ CFC-12+ TSI+ Aerosols
                               ^^

According to this link: http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q you can use Q("var") in the formula to get rid of the error. I was getting the same error but it was solved.

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11")+ Q("CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

this is the solved line of code. I had tried

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11 + CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

but this did not work. It seems that when using formula, the numbers and variables happen to have certain meaning that does not let the use of certain names. in my case error was:

PatsyError: Error evaluating factor: NameError: no data named 'CFC-11+ CFC-12' found
Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols
                           ^^^^^^^^^^^^^^^^^^^

Josef · Answer

patsy is handling the formula parsing and is parsing the string and interpreting it as formula with the given syntax. So some elements in the string are not allowed because they are part of the formula syntax. To keep them as names, patsy also has a code for taking the names as literal text Q which should work in this case http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q

Otherwise, if you already have the full design matrix with all the dummy variables, then there is no reason to go through the formula interface. Using the direct interface with pandas DataFrames or numpy arrays:

sm.OLS(y, x)

will ignore any names of DataFrame columns except for using it as strings in the summary table. Variable/column names are also used as one way of defining restrictions for t_test but those go also through patsy and I am not sure it works with special characters in the names.

Statsmodel Multiple Linear Regression Error - Python

Tags:

python

linear-regression

statsmodels

Windstorm1981

2 Answers

Learner

Josef

Recent Activity

Donate For Us

Statsmodel Multiple Linear Regression Error - Python

Tags:

python

linear-regression

statsmodels

Windstorm1981

2 Answers

Learner

Josef

Related questions

Recent Activity

Donate For Us