Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numbers as variable names not recognized by statsmodels.formula.api

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

Then smf.ols('a ~ b', df) smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

like image 415
user7147790 Avatar asked Nov 23 '16 01:11

user7147790


2 Answers

Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs, an expression of the form:

y ~ a + a:b + np.log(x)

will construct a patsy object of the form:

ModelDesc([Term([EvalFactor("y")])],
      [Term([]),
       Term([EvalFactor("a")]),
       Term([EvalFactor("a"), EvalFactor("b")]),
       Term([EvalFactor("np.log(x)")])])

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

like image 142
juanpa.arrivillaga Avatar answered Oct 09 '22 05:10

juanpa.arrivillaga


As @Josef stated one can use patsy Q to quote the variable:

smf.ols('Q("177sdays") ~ b', df2).fit()
like image 21
Jean Paul Avatar answered Oct 09 '22 05:10

Jean Paul