How to set a threshold for a sklearn classifier based on ROC results?

Tags:

I trained an ExtraTreesClassifier (gini index) using scikit-learn and it suits my needs fairly. Not so good accuracy, but using a 10-fold cross validation, AUC is 0.95. I would like to use this classifier on my work. I am quite new to ML, so please forgive me if I'm asking you something conceptually wrong.

I plotted some ROC curves, and by it, its seems I have a specific threshold where my classifier starts performing well. I'd like to set this value on the fitted classifier, so everytime I'd call predict, the classifiers use that threshold and I could believe in the FP and TP rates.

I also came to this post (scikit .predict() default threshold), where its stated that a threshold is not a generic concept for classifiers. But since the ExtraTreesClassifier has the method predict_proba, and the ROC curve is also related to thresdholds definition, it seems to me I should be available to specify it.

I did not find any parameter, nor any class/interface to use to do it. How can I set a threshold for it for a trained ExtraTreesClassifier (or any other one) using scikit-learn?

Many Thanks, Colis

263

asked Jan 26 '17 00:01

Colis

1 Answers

This is what I have done:

model = SomeSklearnModel()
model.fit(X_train, y_train)
predict = model.predict(X_test)
predict_probabilities = model.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, predict_probabilities)

However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:

predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)

and there you go:

confusion_matrix(y_test, predict_mine)

returns what I wanted:

array([[6927,  309],
       [ 621,  121]])

114

answered Oct 04 '22 12:10

famargar

Related questions
                            
                                Pandas: Get label for value in Series Object
                            
                                How to get a capture group that doesnt always exist?
                            
                                WSGI: what's the purpose of start_response function
                            
                                Python Pandas -- merging mostly duplicated rows
                            
                                Blank line Python PEP8 best practice in class definition [closed]
                            
                                How to set ElementTree Element text field in the constructor
                            
                                Best way to reference the User model in Django >= 1.5
                            
                                How to mock a decorated function
                            
                                How do I make my flask wtforms SelectField look like a dropdown?
                            
                                Calling Python from Oracle
                            
                                Voronoi - Compute exact boundaries of every region
                            
                                Compare strings in python like the sql "like" (with "%" and "_")
                            
                                Extract cow number from image
                            
                                How does the key argument in python's sorted function work?
                            
                                python inspect get methods decorated with @property
                            
                                Convert integer series to timedelta in pandas
                            
                                Django REST Framework : "This field is required." with required=False and unique_together
                            
                                Loop while checking if element in a list in Python
                            
                                Redo for loop iteration in Python
                            
                                Pycharm SqlAlchemy autocomplete not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set a threshold for a sklearn classifier based on ROC results?

Tags:

python

classification

scikit-learn

roc

threshold

Colis

People also ask

1 Answers

famargar

Recent Activity

Donate For Us