Roc curve and cut off point. Python

Tags:

I ran a logistic regression model and made predictions of the logit values. I used this to get the points on the ROC curve:

 from sklearn import metrics  fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

I know metrics.roc_auc_score gives the area under the ROC curve. Can anyone tell me what command will find the optimal cut-off point (threshold value)?

724

asked Feb 25 '15 12:02

Shiva Prakash

2 Answers

You can do this using the epi package in R, however I could not find similar package or example in Python.

The optimal cut off point would be where “true positive rate” is high and the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.

Python code:

import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np from sklearn.metrics import roc_curve, auc  # read the data in df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")  # rename the 'rank' column because there is also a DataFrame method called 'rank' df.columns = ["admit", "gre", "gpa", "prestige"] # dummify rank dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige') # create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])  # manually add the intercept data['intercept'] = 1.0  train_cols = data.columns[1:] # fit the model result = sm.Logit(data['admit'], data[train_cols]).fit() print result.summary()  # Add prediction to dataframe data['pred'] = result.predict(data[train_cols])  fpr, tpr, thresholds =roc_curve(data['admit'], data['pred']) roc_auc = auc(fpr, tpr) print("Area under the ROC curve : %f" % roc_auc)  #################################### # The optimal cut off would be where tpr is high and fpr is low # tpr - (1-fpr) is zero or near to zero is the optimal cut off point #################################### i = np.arange(len(tpr)) # index for df roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)}) roc.iloc[(roc.tf-0).abs().argsort()[:1]]  # Plot tpr vs 1-fpr fig, ax = pl.subplots() pl.plot(roc['tpr']) pl.plot(roc['1-fpr'], color = 'red') pl.xlabel('1-False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic') ax.set_xticklabels([])

The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.

Output:

        1-fpr       fpr        tf     thresholds       tpr   171  0.637363  0.362637  0.000433    0.317628     0.637795

enter image description here

Hope this is helpful.

Edit

To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.

Python Code:

def Find_Optimal_Cutoff(target, predicted):     """ Find the optimal probability cutoff point for a classification model related to event rate     Parameters     ----------     target : Matrix with dependent or target data, where rows are observations      predicted : Matrix with predicted data, where rows are observations      Returns     -------          list type, with optimal cutoff value              """     fpr, tpr, threshold = roc_curve(target, predicted)     i = np.arange(len(tpr))      roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})     roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]      return list(roc_t['threshold'])    # Add prediction probability to dataframe data['pred_proba'] = result.predict(data[train_cols])  # Find optimal probability threshold threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba']) print threshold # [0.31762762459360921]  # Find prediction to the dataframe applying threshold data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)  # Print confusion Matrix from sklearn.metrics import confusion_matrix confusion_matrix(data['admit'], data['pred']) # array([[175,  98], #        [ 46,  81]])

168

answered Oct 13 '22 22:10

Manohar Swamynathan

Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:

optimal_idx = np.argmax(tpr - fpr) optimal_threshold = thresholds[optimal_idx]

answered Oct 13 '22 22:10

cgnorthcutt

Related questions
                            
                                How do I get rid of Python Tkinter root window?
                            
                                PyMongo upsert throws "upsert must be an instance of bool" error
                            
                                How to update values using pymongo?
                            
                                Case insensitive argparse choices
                            
                                Return the current user with Django Rest Framework
                            
                                Number of regex matches
                            
                                Read from a gzip file in python
                            
                                The zip() function in Python 3 [duplicate]
                            
                                How to organize a relatively large Flask application?
                            
                                TypeError: a bytes-like object is required, not 'str'
                            
                                Force Overwrite in Os.Rename
                            
                                Django required field in model form
                            
                                split python source code into multiple files?
                            
                                HTTPError 403 (Forbidden) with Django and python-social-auth connecting to Google with OAuth2
                            
                                Understanding *x ,= lst
                            
                                What is %pylab?
                            
                                Parsing time string in Python
                            
                                matplotlib bar graph black - how do I remove bar borders
                            
                                RFC 1123 Date Representation in Python?
                            
                                How do I call the Python's list while debugging?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With