Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Statsmodels: ols writing Formula with unknown column names

I am trying to run ANOVA using statsmodels, for which I was making models for every column (categorical feature) in my dataframe wrt to one column 'imp' as follows in a loop.

for cat_feature in df:
  data_model = pd.DataFrame({
    'x': df[cat_feature],
    'y': df['imp']})
  model = smf.ols('y ~ x',data=data_model).fit()
  res = sm.stats.anova_lm(model, typ=1)

But what I would like to do is this:

smf.ols(df['imp'] ~ df[cat_feature],data=df).fit()

but this isn't the right syntax.

without having to make the data_model each time with one of its column always the same.

Is it possible?

or simply put

model = smf.ols('A~B', data=df).fit()

works but

model2 = smf.ols(df.A ~ df.B, data=df).fit()

doesn't.

like image 543
julian joseph Avatar asked Nov 19 '25 20:11

julian joseph


1 Answers

The formula interface, lower case ols in contrast to upper case OLS, needs a formula string as first argument.

So, I think you want string concatenation

smf.ols('imp ~' + cat_feature, data=df).fit()

Specifying pandas Series and DataFrames or numpy arrays only works with the main class OLS

import statsmodels.api as sm
model2 = sm.OLS(df['imp'], df[cat_feature]).fit()

As background information:

OLS is the actual model class
ols from formula.api is just a convenient alias for the method OLS.from_formula that preprocesses the formula information before creating an OLS instance.

The character ~ is a required element of the formula string, but it is not valid to separate arguments in regular python classes, methods or functions.

One crucial distinction between the array/dataframe and the formula interface:

The array interface, i.e. using OLS as in
sm.OLS(df['imp'], df[cat_feature])
does not do any preprocessing of the data, i.e. exog is taken as is. Specifically, no constant is added and categorical features are not encoded in some numerical dummy or contrast representation.

The formula interface uses patsy that preprocesses the data, in large parts identically to R's formulas. This means that a constant is added by default and any non-numeric columns, like those that contain strings, are processes as categorical or factor variables.

like image 104
Josef Avatar answered Nov 21 '25 09:11

Josef



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!