Using ols function with parameters that contain numbers/spaces

Tags:

I am having a lot of difficulty using the statsmodels.formula.api function

       ols(formula,data).fit().rsquared_adj

due to the nature of the names of my predictors. The predictors have numbers and spaces etc in them which it clearly doesn't like. I understand that I need to use something like patsy.builtins.Q So lets say my predictor would be weight.in.kg , it should be entered as follows:

Q("weight.in.kg")

so I need to take my formula from a list, and the difficulty arises in modifying every item in the list with this patsy.builtin.Q

formula = "{} ~ {} + 1".format(response, ' + '.join([candidate])

with [candidate] being my list of predictors.

My question to you, dearest python experts, is how on earth do I put every individual item in the list [candidate] within the quotes in the following expression:

Q('')

so that the ols function can actually read it? Apologies if this is super obvious, me no good at python.

239

asked Jul 01 '16 15:07

Thomas

1 Answers

Right now you're starting with a list of terms that you want in your formula, then trying to paste them together into a complicated string, which patsy will parse and convert back into a list of terms. You can see the data structure that patsy generates for this kind of formula (ModelDesc.from_formula is patsy's parser):

In [7]: from patsy import ModelDesc

In [8]: ModelDesc.from_formula("y ~ x1 + x2 + x3")
Out[8]: 
ModelDesc(lhs_termlist=[Term([EvalFactor('y')])],
          rhs_termlist=[Term([]),
                        Term([EvalFactor('x1')]),
                        Term([EvalFactor('x2')]),
                        Term([EvalFactor('x3')])])

This might look a little intimidating, but it's pretty simple really -- you have a ModelDesc, which represents a single formula, and it has a left-hand-side list of terms and a right-hand-side list of terms. Each term is represented by a Term object, and each Term has a list of factors. (Here each term just has a single factor -- if you had any interactions then those terms would have multiple factors.) Also, the "empty interaction" Term([]) is how patsy represents the intercept term.

So you can avoid all this complicated quoting/parsing stuff by directly creating the terms you want and passing them to patsy, skipping the string parsing step

from patsy import ModelDesc, Term, LookupFactor

response_terms = [Term([LookupFactor(response)])]
# start with intercept...
model_terms = [Term([])]
# ...then add another term for each candidate
model_terms += [Term([LookupFactor(c)]) for c in candidates]
model_desc = ModelDesc(response_terms, model_terms)

and now you can pass that model_desc object into any function where you'd normally pass a patsy formula:

ols(model_desc, data).fit().rsquared_adj

There's another trick here: you'll notice that the first example has EvalFactor objects, and now we're using LookupFactor objects instead. The difference is that EvalFactor takes a string of arbitrary Python code, which is nice if you want to write something like np.log(x1), but really annoying if you have variables with name like weight.in.kg. LookupFactor directly takes the name of a variable to look up in your data, so no further quoting is needed.

Alternatively, you could do this with some fancier Python string processing, like:

quoted = ["Q('{}')".format(c) for c in candidates]
formula = "{} ~ {} + 1".format(response, ' + '.join(quoted))

But while this is a bit simpler to start with, it's much more fragile -- for example, think about (or try) what happens if one of your parameters contains a quote character! You should never write something like this in a processing pipeline where the candidate names come from somewhere else that you can't control (e.g. a random CSV file) -- you could get all kinds of arbitrary code execution. The solution above avoids all of these problems.

Reference:

https://patsy.readthedocs.io/en/latest/expert-model-specification.html
https://patsy.readthedocs.io/en/latest/formulas.html

197

answered Oct 31 '22 17:10

Nathaniel J. Smith

Related questions
                            
                                List of List to Key-Value Pairs
                            
                                Strategy pattern in Python when a "strategy" consists of more than one function
                            
                                Read random sample of files on S3 with Pyspark
                            
                                Python PyInstaller and include window icon
                            
                                Releasing for Ubuntu
                            
                                python No module named ujson, while it's already installed
                            
                                Code optimization - number of function calls in Python
                            
                                How to train a model in C++ with tensorflow?
                            
                                Replace division by zero numpy
                            
                                Access Pandas Data Frame row with index value
                            
                                Writing to a text file error - Must be str, not list
                            
                                Schedule python scripts to run in AWS
                            
                                Watermark Removal on PDF with PyPDF2
                            
                                Python equivalent of bash sort lexicographical and numerical
                            
                                Add number, then tuple to list as a tuple, but it drops outer tuple [duplicate]
                            
                                Spark with Cython
                            
                                How to create AWS Lambda deployment package that uses Couchbase Python client
                            
                                Can you have subprocesss.Popen retain color in stdout/stderr?
                            
                                How does hashing work for python sets [duplicate]
                            
                                Iterating through multidimensional lists?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using ols function with parameters that contain numbers/spaces

Tags:

python

list

pandas

charts

patsy

Thomas

People also ask

1 Answers

Nathaniel J. Smith

Recent Activity

Donate For Us