Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Controlling the threshold in Logistic Regression in Scikit Learn

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.

I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.

Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?

I did not find anything in the documentation page.

Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?

like image 905
London guy Avatar asked Feb 25 '15 10:02

London guy


People also ask

How do you choose the threshold in logistic regression?

The logistic regression assigns each row a probability of bring True and then makes a prediction for each row where that prbability is >= 0.5 i.e. 0.5 is the default threshold.

What is the effect of threshold on the logistic regression model?

Yes. If you set the threshold to 1, then the classifier will always predict 0, which will make its accuracy p(y=0); if you set the threshold to 0, the classifier will always predict 1, and have accuracy p(y=1); in between, it will go through various different values.

How do you set threshold value?

Click the performance pattern that you want: High values are good, Middle values are good, or Low values are good. To specify a threshold value, click in the threshold box and enter the threshold number you want. Click the arrow for the threshold value to specify which range the value itself falls into.

Why is thresholding important in logistic regression?

Thresholding. We can convert the probabilities to predictions using what's called a threshold value, t. If the probability of poor care is greater than this threshold value, t, we predict poor quality care. But if the probability of poor care is less than the threshold value, t, then we predict good quality care.


2 Answers

There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

pred_proba_df = pd.DataFrame(model.predict_proba(x_test)) threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99] for i in threshold_list:     print ('\n******** For i = {} ******'.format(i))     Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)     test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),                                            Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))     print('Our testing accuracy is {}'.format(test_accuracy))      print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),                            Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))) 

Best!

like image 74
jazib jamil Avatar answered Sep 22 '22 13:09

jazib jamil


Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

like image 43
Nikita Astrakhantsev Avatar answered Sep 21 '22 13:09

Nikita Astrakhantsev