Statsmodels: ols writing Formula with unknown column names

Question

I am trying to run ANOVA using statsmodels, for which I was making models for every column (categorical feature) in my dataframe wrt to one column 'imp' as follows in a loop.

for cat_feature in df:
  data_model = pd.DataFrame({
    'x': df[cat_feature],
    'y': df['imp']})
  model = smf.ols('y ~ x',data=data_model).fit()
  res = sm.stats.anova_lm(model, typ=1)

But what I would like to do is this:

smf.ols(df['imp'] ~ df[cat_feature],data=df).fit()

but this isn't the right syntax.

without having to make the data_model each time with one of its column always the same.

Is it possible?

or simply put

model = smf.ols('A~B', data=df).fit()

works but

model2 = smf.ols(df.A ~ df.B, data=df).fit()

doesn't.

Josef · Accepted Answer

The formula interface, lower case ols in contrast to upper case OLS, needs a formula string as first argument.

So, I think you want string concatenation

smf.ols('imp ~' + cat_feature, data=df).fit()

Specifying pandas Series and DataFrames or numpy arrays only works with the main class OLS

import statsmodels.api as sm
model2 = sm.OLS(df['imp'], df[cat_feature]).fit()

As background information:

OLS is the actual model class
ols from formula.api is just a convenient alias for the method OLS.from_formula that preprocesses the formula information before creating an OLS instance.

The character ~ is a required element of the formula string, but it is not valid to separate arguments in regular python classes, methods or functions.

One crucial distinction between the array/dataframe and the formula interface:

The array interface, i.e. using OLS as in
sm.OLS(df['imp'], df[cat_feature])
does not do any preprocessing of the data, i.e. exog is taken as is. Specifically, no constant is added and categorical features are not encoded in some numerical dummy or contrast representation.

The formula interface uses patsy that preprocesses the data, in large parts identically to R's formulas. This means that a constant is added by default and any non-numeric columns, like those that contain strings, are processes as categorical or factor variables.

Statsmodels: ols writing Formula with unknown column names

Tags:

python

syntax

statsmodels

anova

julian joseph

1 Answers

Josef

Recent Activity

Donate For Us

Statsmodels: ols writing Formula with unknown column names

Tags:

python

syntax

statsmodels

anova

julian joseph

1 Answers

Josef

Related questions

Recent Activity

Donate For Us