I am running (what I think is) as fairly straightforward multiple linear regression model fit using Stats model.
My code is as follows:
y = 'EXITS|20:00:00'
all_columns = "+".join(y_2015piv.columns - ['EXITS|20:00:00'])
reg_formula = "y~" + all_columns
lm= smf.ols(formula=reg_formula, data=y_2015piv).fit()
Because I have about 30 factor variables I'm creating the formula using Python string manipulation. "y" is as presented above. all_columns is the dataframe y_2015piv columns without "y".
This is all_columns:
DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep
The values in the dataframe are continuous numerical variables and 0/1 dummy variables.
When I try and fit the model I get this error:
PatsyError: numbers besides '0' and '1' are only allowed with **
y~DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep
There is nothing on line that addresses what this could be. Any help appreciated.
By the way, when I fit this model in Scikit-learn it works fine. So I figure the data is in order.
Thanks in advance.
The first error that I got was this:
PatsyError: numbers besides '0' and '1' are only allowed with **
Temp ~ MEI+ CO2+ CH4+ N2O+ CFC-11+ CFC-12+ TSI+ Aerosols
^^
According to this link: http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q you can use Q("var") in the formula to get rid of the error. I was getting the same error but it was solved.
linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11")+ Q("CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()
this is the solved line of code. I had tried
linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11 + CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()
but this did not work. It seems that when using formula, the numbers and variables happen to have certain meaning that does not let the use of certain names. in my case error was:
PatsyError: Error evaluating factor: NameError: no data named 'CFC-11+ CFC-12' found
Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols
^^^^^^^^^^^^^^^^^^^
patsy is handling the formula parsing and is parsing the string and interpreting it as formula with the given syntax. So some elements in the string are not allowed because they are part of the formula syntax. To keep them as names, patsy also has a code for taking the names as literal text Q
which should work in this case
http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q
Otherwise, if you already have the full design matrix with all the dummy variables, then there is no reason to go through the formula interface. Using the direct interface with pandas DataFrames or numpy arrays:
sm.OLS(y, x)
will ignore any names of DataFrame columns except for using it as strings in the summary table. Variable/column names are also used as one way of defining restrictions for t_test but those go also through patsy and I am not sure it works with special characters in the names.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With