Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a function on all possible combination of columns in a dataframe in Python -- Better way

What i am trying to do is to apply a linear regression using statsmodels.api for all possible pairwise columns combinations of a Dataframe.

I was able to do it for the following code :

For the dataframe df :

import statsmodels.api as sm
import numpy as np
import pandas as pd

#generate example Dataframe
df = pd.DataFrame(abs(np.random.randn(50, 4)*10), columns=list('ABCD'))

#extract all possible combinations of columns by column index number
i, j = np.tril_indices(df.shape[1], -1)

#generate a for loop that creates the variable an run the regression on each pairwise combination
for idx,item in enumerate(list(zip(i, j))):
    exec("model" + str(idx) +" = sm.OLS(df.iloc[:,"+str(item[0])+"],df.iloc[:,"+str(item[1])+"])")
    exec("regre_result" + str(idx) +" = model" + str(idx)+".fit()")

regre_result0.summary()

OLS Regression Results
Dep. Variable:  B   R-squared:  0.418
Model:  OLS Adj. R-squared: 0.406
Method: Least Squares   F-statistic:    35.17
Date:   Tue, 09 Jan 2018    Prob (F-statistic): 3.00e-07
Time:   14:16:25    Log-Likelihood: -174.29
No. Observations:   50  AIC:    350.6
Df Residuals:   49  BIC:    352.5
Df Model:   1       
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [0.025  0.975]
A   0.7189  0.121   5.930   0.000   0.475   0.962
Omnibus:    14.290  Durbin-Watson:  1.828
Prob(Omnibus):  0.001   Jarque-Bera (JB):   16.289
Skew:   1.101   Prob(JB):   0.000290
Kurtosis:   4.722   Cond. No.   1.00

It works, but i imagine there is an easier way to achieve similar results, anybody can point me the best way to achieve it ?

like image 699
RiskTech Avatar asked Jan 09 '18 13:01

RiskTech


People also ask

How do I apply a function to multiple columns in a data frame?

To apply a function that takes as input multiple column values, use the DataFrame's apply(~) method.

Is Itertuples faster than apply?

While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.

Is apply function faster than for loop Python?

apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).

How do you apply a function to a DataFrame column in Python?

DataFrame - apply() function. The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).


1 Answers

Why are you doing it this way with exec and loads of variables instead of just appending to a list?

You can also use itertools.combinations to get all pairs of columns.

Try something like this:

In [1]: import itertools
In [2]: import pandas as pd
In [3]: daf = pd.DataFrame(columns=list('ABCD'))
In [4]: list(itertools.combinations(daf.columns, 2))
Out[4]: [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]
In [6]: col_pairs = list(itertools.combinations(daf.columns, 2))
In [6]: models = []
In [7]: results = []
In [8]: for a,b in col_pairs:
     ...:     model = get_model(df[a],df[b])
     ...:     models.append(model)
     ...:     result = get_result(model)
     ...:     results.append(result)
In [9]: results[0].summary()

Where get_model will call sm.OLS and get_result will call fit (or just call those here without putting them in external functions. But don't do it this crazy exec way - best practice is to avoid using it).

like image 180
LangeHaare Avatar answered Sep 24 '22 19:09

LangeHaare