Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to penalize False Negatives more than False Positives

From the business perspective, false negatives lead to about tenfold higher costs (real money) than false positives. Given my standard binary classification models (logit, random forest, etc.), how can I incorporate this into my model?

Do I have to change (weight) the loss function in favor of the 'preferred' error (FP) ? If so, how to do that?

like image 598
ivegotaquestion Avatar asked Mar 07 '18 11:03

ivegotaquestion


People also ask

How can false negatives and false positives be reduced?

To minimize the number of False Negatives (FN) or False Positives (FP) we can also retrain a model on the same data with slightly different output values more specific to its previous results. This method involves taking a model and training it on a dataset until it optimally reaches a global minimum.

How do you reduce the number of false negatives?

Current methods that are available to minimize cases like false negatives include weight change, performing data aug- mentation to create a biased dataset, and changing the decision boundary line [2].

Why is a false negative worse than a false positive?

As we discussed, false negative results is worse than a false positive since a bug stays in the code indefinitly. We introduced a technique called, mutation testing. Using mutation testing, test engineers can identify false negatives in code.

Can you cite some examples where a false negative important than a false positive?

The patient may be diagnosed with diabetes when they actually do not have the disease. This is a false positive. This can lead to unnecessary medical treatment. On the other hand a false negative is when the test shows that a patient does not have diabetes when they actually do.


1 Answers

There are several options for you:

  • As suggested in the comments, class_weight should boost the loss function towards the preferred class. This option is supported by various estimators, including sklearn.linear_model.LogisticRegression, sklearn.svm.SVC, sklearn.ensemble.RandomForestClassifier, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.

  • You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to 1.0 effectively means false_negatives close to 0.0, which is what to want. For that, use sklearn.model_selection.cross_val_predict and sklearn.metrics.precision_recall_curve functions:

    y_scores = cross_val_predict(classifier, x_train, y_train, cv=3,
                                 method="decision_function")
    
    precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
    

    If you plot the precisions and recalls against the thresholds, you should see the picture like this:

    precision-recall-tradeoff

    After picking the best threshold, you can use the raw scores from classifier.decision_function() method for your final classification.

Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).

like image 197
Maxim Avatar answered Oct 04 '22 13:10

Maxim