Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

statsmodels linear regression - patsy formula to include all predictors in model

Say I have a dataframe (let's call it DF) where y is the dependent variable and x1, x2, x3 are my independent variables. In R I can fit a linear model using the following code, and the . will include all of my independent variables in the model:

# R code for fitting linear model
result = lm(y ~ ., data=DF)

I can't figure out how to do this with statsmodels using patsy formulas without explicitly adding all of my independent variables to the formula. Does patsy have an equivalent to R's .? I haven't had any luck finding it in the patsy documentation.

like image 326
Greg Avatar asked Mar 13 '14 19:03

Greg


People also ask

How many independent variables are considered in the Patsy formula Y x1 x2 )'?

This formula specifies a model with 2 independent variables: x1 and the sum of x1 and x2 .

What is the difference between Statsmodels and Sklearn linear regression?

A key difference between the two libraries is how they handle constants. Scikit-learn allows the user to specify whether or not to add a constant through a parameter, while statsmodels' OLS class has a function that adds a constant to a given array.


3 Answers

I haven't found . equivalent in patsy documentation either. But what it lacks in conciseness, it can make-up for by giving strong string manipulation in Python. So, you can get formula involving all variable columns in DF using

all_columns = "+".join(DF.columns - ["y"])

This gives x1+x2+x3 in your case. Finally, you can create a string formula using y and pass it to any fitting procedure

my_formula = "y~" + all_columns
result = lm(formula=my_formula, data=DF)
like image 91
Sudeep Juvekar Avatar answered Sep 19 '22 12:09

Sudeep Juvekar


No this doesn't exist in patsy yet, unfortunately. See this issue.

like image 32
jseabold Avatar answered Sep 18 '22 12:09

jseabold


As this is still not included in patsy, I wrote a small function that I call when I need to run statsmodels models with all columns (optionally with exceptions)

def ols_formula(df, dependent_var, *excluded_cols):
    '''
    Generates the R style formula for statsmodels (patsy) given
    the dataframe, dependent variable and optional excluded columns
    as strings
    '''
    df_columns = list(df.columns.values)
    df_columns.remove(dependent_var)
    for col in excluded_cols:
        df_columns.remove(col)
    return dependent_var + ' ~ ' + ' + '.join(df_columns)

For example, for a dataframe called df with columns y, x1, x2, x3, running ols_formula(df, 'y', 'x3') returns 'y ~ x1 + x2'

like image 21
emredjan Avatar answered Sep 18 '22 12:09

emredjan