Scikit Logistic Regression summary output?

1 Answers

As you and others have pointed out, this is a limitation of scikit learn. Before discussing below a scikit approach for your question, the “best” option is to use statsmodels as follows:

import statsmodels.api as sm 
smlog = sm.Logit(y,sm.add_constant(X)).fit()
smlog.summary()

X represents your input features/predictors matrix and y represents the outcome variable. Statsmodels works well if X lacks highly correlated features, lacks low variance features, feature(s) don’t generate “perfect/quasi-perfect separation”, and any categorical features are reduced to “n-1” levels i.e., dummy-coded (and not “n” levels i.e., one-hot encoded as described here: dummy variable trap).

However, if above isn't feasible/practical, one scikit approach is coded below for fairly equivalent results - in terms of feature coefficients/odds with their standard errors and 95%CI estimates. Essentially, the code generates these results from distinct logistic regression scikit models trained against distinct test-train splits of your data. Again, make sure categorical features are dummy coded to n-1 levels (or your scikit coefficients will be incorrect for categorical features).

 #Instantiate logistic regression model with regularization turned OFF
log_nr = LogisticRegression(fit_intercept = True, penalty 
= "none")

##Generate 5 distinct random numbers - as random seeds for 5 test-train splits
import random
randomlist = random.sample(range(1, 10000), 5)

##Create features column 
coeff_table = pd.DataFrame(X.columns, columns=["features"])

##Assemble coefficients over logistic regression models on 5 random data splits
#iterate over random states while keeping track of `i`
from sklearn.model_selection import train_test_split
for i, state in enumerate(randomlist):
    train_x, test_x, train_y, test_y = train_test_split(X, y,   stratify=y, 
    test_size=0.3, random_state=state) #5 test-train splits
    log_nr.fit(train_x, train_y) #fit logistic model 
    coeff_table[f"coefficients_{i+1}"] = np.transpose(log_nr.coef_) 

##Calculate mean and std error for model coefficients (from 5 models above)
coeff_table["mean_coeff"] = coeff_table.mean(axis=1)
coeff_table["se_coeff"] = coeff_table.iloc[:, 1:6].sem(axis=1)    

#Calculate 95% CI intervals for feature coefficients
coeff_table["95ci_se_coeff"] = 1.96*coeff_table["se_coeff"]
coeff_table["coeff_95ci_LL"] = coeff_table["mean_coeff"] - 
coeff_table["95ci_se_coeff"]
coeff_table["coeff_95ci_UL"] = coeff_table["mean_coeff"] + 
coeff_table["95ci_se_coeff"]

Finally, (optionally) convert coefficients to odds by exponentiating as follows. Odds ratios are my favorite output from logistic regression and these are appended to your dataframe using code below.

#Calculate odds ratios and 95% CI (LL = lower limit, UL = upper limit) intervals for each feature
coeff_table["odds_mean"] = np.exp(coeff_table["mean_coeff"])
coeff_table["95ci_odds_LL"] = np.exp(coeff_table["coeff_95ci_LL"])
coeff_table["95ci_odds_UL"] = np.exp(coeff_table["coeff_95ci_UL"])

This answer builds upon on a somewhat related reply by @pciunkiewicz available here : Collate model coefficients across multiple test-train splits from sklearn

answered Nov 15 '22 00:11

veg2020

Related questions
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Why is Garbage Collection so Slow?
                            
                                Anaconda 3.5 (64bit Windows) Install cx_Oracle
                            
                                Create a formal linear function in Sympy
                            
                                TensorFlow installation results in ImportError: No module named tensorflow
                            
                                py2exe the following modules appear to be missing
                            
                                Pandas.read_excel reads date into timestamp, I want a string
                            
                                Motif search with Gibbs sampler
                            
                                run untrusted python code that is able to communicate with main program but isolated from the system
                            
                                gspread findall() only within 1 column
                            
                                What can cause the simple invocation of asyncio.new_event_loop() to hang?
                            
                                Extracting attributes from images using Scikit-image
                            
                                Create contingency table Pandas with counts and percentages
                            
                                What is the meaning of the error cannot handle a non-unique multi index in groupby clause?
                            
                                Cant stop\kill all processes at once produced by multiprocessing.Pool
                            
                                How to implement django otp?
                            
                                How can I stop python from converting a mySQL DATETIME to a datetime.date when the time is 00:00:00?
                            
                                Pandas - SQL case statement equivalent
                            
                                Preserve original cell formatting in openpyxl
                            
                                How to set ttk calendar programmatically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit Logistic Regression summary output?

Tags:

python

scikit-learn

statsmodels

TheDude

People also ask

1 Answers

veg2020

Recent Activity

Donate For Us