Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit Logistic Regression summary output?

is there a way to have a similar, nice output for the scikit logistic regression models as in statsmodels? With all the p-values, std. errors etc. in one table?

like image 765
TheDude Avatar asked May 21 '16 05:05

TheDude


People also ask

How do you present the results of logistic regression?

We can use the following general format to report the results of a logistic regression model: Logistic regression was used to analyze the relationship between [predictor variable 1], [predictor variable 2], … [predictor variable n] and [response variable].

How do you find the regression summary?

Regression analysis is the analysis of relationship between dependent and independent variable as it depicts how dependent variable will change when one or more independent variable changes due to factors, formula for calculating it is Y = a + bX + E, where Y is dependent variable, X is independent variable, a is ...


1 Answers

As you and others have pointed out, this is a limitation of scikit learn. Before discussing below a scikit approach for your question, the “best” option is to use statsmodels as follows:

import statsmodels.api as sm 
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()

X represents your input features/predictors matrix and y represents the outcome variable. Statsmodels works well if X lacks highly correlated features, lacks low variance features, feature(s) don’t generate “perfect/quasi-perfect separation”, and any categorical features are reduced to “n-1” levels i.e., dummy-coded (and not “n” levels i.e., one-hot encoded as described here: dummy variable trap).

However, if above isn't feasible/practical, one scikit approach is coded below for fairly equivalent results - in terms of feature coefficients/odds with their standard errors and 95%CI estimates. Essentially, the code generates these results from distinct logistic regression scikit models trained against distinct test-train splits of your data. Again, make sure categorical features are dummy coded to n-1 levels (or your scikit coefficients will be incorrect for categorical features).

 #Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty 
= "none")

##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)

##Create features column 
coeff_table = pd.DataFrame(X.columns, columns=["features"])

##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
    train_x, test_x, train_y, test_y = train_test_split(X, y,   stratify=y, 
    test_size=0.3, random_state=state) #5 test-train splits
    log_nr.fit(train_x, train_y) #fit logistic model 
    coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_) 

##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)    

#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] - 
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] + 
coeff_table["95ci_se_coeff"]

Finally, (optionally) convert coefficients to odds by exponentiating as follows. Odds ratios are my favorite output from logistic regression and these are appended to your dataframe using code below.

#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])

This answer builds upon on a somewhat related reply by @pciunkiewicz available here : Collate model coefficients across multiple test-train splits from sklearn

like image 99
veg2020 Avatar answered Nov 15 '22 00:11

veg2020