Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ols function with parameters that contain numbers/spaces

I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj 

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

with [candidate] being my list of predictors.

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.

like image 239
Thomas Avatar asked Jul 01 '16 15:07

Thomas


People also ask

Why do we add constant in Statsmodel?

First, we always need to add the constant. The reason for this is that it takes care of the bias in the data (a constant difference which is there for all observations).

What is OLS Statsmodel?

The OLS() function of the statsmodels. api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data. The summary() method is used to obtain a table which gives an extensive description about the regression results.

How do I use OLS in Python?

OLS class and and its initialization OLS(y, X) method. This method takes as an input two array-like objects: X and y . In general, X will either be a numpy array or a pandas data frame with shape (n, p) where n is the number of data points and p is the number of predictors.


1 Answers

Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1), but really annoying if you have variables with name like weight.in.kg. LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

Reference:

  • https://patsy.readthedocs.io/en/latest/expert-model-specification.html
  • https://patsy.readthedocs.io/en/latest/formulas.html
like image 197
Nathaniel J. Smith Avatar answered Oct 31 '22 17:10

Nathaniel J. Smith