What i am trying to do is to apply a linear regression using statsmodels.api for all possible pairwise columns combinations of a Dataframe.
I was able to do it for the following code :
For the dataframe df :
import statsmodels.api as sm
import numpy as np
import pandas as pd
#generate example Dataframe
df = pd.DataFrame(abs(np.random.randn(50, 4)*10), columns=list('ABCD'))
#extract all possible combinations of columns by column index number
i, j = np.tril_indices(df.shape[1], -1)
#generate a for loop that creates the variable an run the regression on each pairwise combination
for idx,item in enumerate(list(zip(i, j))):
exec("model" + str(idx) +" = sm.OLS(df.iloc[:,"+str(item[0])+"],df.iloc[:,"+str(item[1])+"])")
exec("regre_result" + str(idx) +" = model" + str(idx)+".fit()")
regre_result0.summary()
OLS Regression Results
Dep. Variable: B R-squared: 0.418
Model: OLS Adj. R-squared: 0.406
Method: Least Squares F-statistic: 35.17
Date: Tue, 09 Jan 2018 Prob (F-statistic): 3.00e-07
Time: 14:16:25 Log-Likelihood: -174.29
No. Observations: 50 AIC: 350.6
Df Residuals: 49 BIC: 352.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
A 0.7189 0.121 5.930 0.000 0.475 0.962
Omnibus: 14.290 Durbin-Watson: 1.828
Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.289
Skew: 1.101 Prob(JB): 0.000290
Kurtosis: 4.722 Cond. No. 1.00
It works, but i imagine there is an easier way to achieve similar results, anybody can point me the best way to achieve it ?
To apply a function that takes as input multiple column values, use the DataFrame's apply(~) method.
While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.
apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
DataFrame - apply() function. The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).
Why are you doing it this way with exec and loads of variables instead of just appending to a list?
You can also use itertools.combinations
to get all pairs of columns.
Try something like this:
In [1]: import itertools
In [2]: import pandas as pd
In [3]: daf = pd.DataFrame(columns=list('ABCD'))
In [4]: list(itertools.combinations(daf.columns, 2))
Out[4]: [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]
In [6]: col_pairs = list(itertools.combinations(daf.columns, 2))
In [6]: models = []
In [7]: results = []
In [8]: for a,b in col_pairs:
...: model = get_model(df[a],df[b])
...: models.append(model)
...: result = get_result(model)
...: results.append(result)
In [9]: results[0].summary()
Where get_model
will call sm.OLS
and get_result
will call fit
(or just call those here without putting them in external functions. But don't do it this crazy exec way - best practice is to avoid using it).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With