I am having a lot of difficulty using the statsmodels.formula.api function
ols(formula,data).fit().rsquared_adj
due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:
Q("weight.in.kg")
so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q
formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])
with [candidate] being my list of predictors.
My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:
Q('')
so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.
First, we always need to add the constant. The reason for this is that it takes care of the bias in the data (a constant difference which is there for all observations).
The OLS() function of the statsmodels. api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data. The summary() method is used to obtain a table which gives an extensive description about the regression results.
OLS class and and its initialization OLS(y, X) method. This method takes as an input two array-like objects: X and y . In general, X will either be a numpy array or a pandas data frame with shape (n, p) where n is the number of data points and p is the number of predictors.
Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula
is patsy's parser):
In [7]: from patsy import ModelDesc
In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]:
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
rhs_termlist=[Term([]),
Term([EvalFactor('x1')]),
Term([EvalFactor('x2')]),
Term([EvalFactor('x3')])])
This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc
, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term
object, and each Term
has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([])
is how patsy represents the intercept term.
So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step
from patsy import ModelDesc, Term, LookupFactor
response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)
and now you can pass that model_desc
object into any function where you'd normally pass a patsy formula:
ols(model_desc, data).fit().rsquared_adj
There's another trick here: you'll notice that the first example has EvalFactor
objects, and now we're using LookupFactor
objects instead. The difference is that EvalFactor
takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1)
, but really annoying if you have variables with name like weight.in.kg
. LookupFactor
directly takes the name of a variable to look up in your data, so no further quoting is needed.
Alternatively, you could do this with some fancier Python string processing, like:
quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))
But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.
Reference:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With