Consider the following example:
import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})
Then
smf.ols('a ~ b', df)
smf.ols('177sdays ~ b', df2)
And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?
Apparently, statsmodels
uses a library called patsy
to interpret the formulas passed to ols
. From the docs, an expression of the form:
y ~ a + a:b + np.log(x)
will construct a patsy object of the form:
ModelDesc([Term([EvalFactor("y")])],
[Term([]),
Term([EvalFactor("a")]),
Term([EvalFactor("a"), EvalFactor("b")]),
Term([EvalFactor("np.log(x)")])])
EvalFactor
then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers. I.e.
the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
As @Josef stated one can use patsy Q
to quote the variable:
smf.ols('Q("177sdays") ~ b', df2).fit()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With