I am using the LogisticRegression()
method in scikit-learn
on a highly unbalanced data set. I have even turned the class_weight
feature to auto
.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression()
method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5
value as threshold for all the classes regardless of the parameter values?
The logistic regression assigns each row a probability of bring True and then makes a prediction for each row where that prbability is >= 0.5 i.e. 0.5 is the default threshold.
Yes. If you set the threshold to 1, then the classifier will always predict 0, which will make its accuracy p(y=0); if you set the threshold to 0, the classifier will always predict 1, and have accuracy p(y=1); in between, it will go through various different values.
Click the performance pattern that you want: High values are good, Middle values are good, or Low values are good. To specify a threshold value, click in the threshold box and enter the threshold number you want. Click the arrow for the threshold value to specify which range the value itself falls into.
Thresholding. We can convert the probabilities to predictions using what's called a threshold value, t. If the probability of poor care is greater than this threshold value, t, we predict poor quality care. But if the probability of poor care is less than the threshold value, t, then we predict good quality care.
There is a little trick that I use, instead of using model.predict(test_data)
use model.predict_proba(test_data)
. Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test)) threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99] for i in threshold_list: print ('\n******** For i = {} ******'.format(i)) Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0) test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1), Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)) print('Our testing accuracy is {}'.format(test_accuracy)) print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1), Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With