Numbers as variable names not recognized by statsmodels.formula.api

Question

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

Then smf.ols('a ~ b', df) smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

juanpa.arrivillaga · Accepted Answer

Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs, an expression of the form:

y ~ a + a:b + np.log(x)

will construct a patsy object of the form:

ModelDesc([Term([EvalFactor("y")])],
      [Term([]),
       Term([EvalFactor("a")]),
       Term([EvalFactor("a"), EvalFactor("b")]),
       Term([EvalFactor("np.log(x)")])])

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

Jean Paul · Answer

As @Josef stated one can use patsy Q to quote the variable:

smf.ols('Q("177sdays") ~ b', df2).fit()

Numbers as variable names not recognized by statsmodels.formula.api

Tags:

python

pandas

statsmodels

user7147790

2 Answers

juanpa.arrivillaga

Jean Paul

Recent Activity

Donate For Us

Numbers as variable names not recognized by statsmodels.formula.api

Tags:

python

pandas

statsmodels

user7147790

2 Answers

juanpa.arrivillaga

Jean Paul

Related questions

Recent Activity

Donate For Us