I'am trying to measure the performance of a computer vision program that tries to detect objects in video. I have 3 different versions of the program which have different parameters. I've benchmarked each of this versions and got 3 pairs of (False positives percent, False negative percent).
Now i want to compare the versions with each other and then I wonder if it makes sense to combine false positives and false negatives into a single value and use that to do the comparation. for example, take the equation falsePositives/falseNegatives and see which is smaller.
The false positive rate is calculated as FP/FP+TN, where FP is the number of false positives and TN is the number of true negatives (FP+TN being the total number of negatives). It's the probability that a false alarm will be raised: that a positive result will be given when the true value is negative.
If the cost of false positives and false negatives are different then F1 is your savior. F1 is best if you have an uneven class distribution. Precision is how sure you are of your true positives whilst recall is how sure you are that you are not missing any positives.
To minimize the number of False Negatives (FN) or False Positives (FP) we can also retrain a model on the same data with slightly different output values more specific to its previous results. This method involves taking a model and training it on a dataset until it optimally reaches a global minimum.
Depending on the desired test result, both positive and negative can be considered bad. For example, in a test for COVID, you want a negative test result. Although a positive result is deemed to be bad, a False Negative is the worst.
In addition to the popular Area Under the ROC Curve (AUC)
measure mentioned by @alchemist-al, there's a score that combines both precision and recall (which are defined in terms of TP/FP/TN/FN) called the F-measure that goes from 0 to 1 (0 being the worst, 1 the best):
F-measure = 2*precision*recall / (precision+recall)
where
precision = TP/(TP+FP) , recall = TP/(TP+FN)
A couple of other possible solutions:
-Your false-positive rate (fp) and false-negative rate (fn) may depend on a threshold. If you plot the curve where the y-value is (1-fn), and the x-value is (fp), you'll be plotting the Receiver-Operator-Characteristic (ROC) curve. The Area Under the ROC Curve (AUC) is one popular measure of quality.
-AUC can be weighted if there are certain regions of interest
-Report the Equal-Error Rate. For some threshold, fp=fn. Report this value.
It depends on how much detail you want in the comparision.
Combining the two figures will give you an overall sense of error margin but no insight into what sort of error so if you just want to know what is "more correct" in an overall sense then it's fine.
If, on the other hand, you're actually wanting to use the results for some sort of more in depth determination of whether the process is suited to a particular problem then I would imagine keeping them seperate is a good idea. e.g. Sometimes false negatives are a very different problem to false positives in a real world setting. Did the robot just avoid an object that wasn't there... or fail to notice it was heading off the side of a cliff?
In short, there's no hard and fast global rule for determining how effective vision based on one super calculation. It comes down to what you're planning to do with the information that's the important bit.
You need to factor in how "important" false positive are relative to false negatives.
For example, if your program is designed to recognise people's faces, the both false positives and false negatives are equally harmless and you can probably just combine them linearly.
But if your program was designed to detect bombs, then false positives aren't a huge deal (i.e. saying "this is a bomb" when it's actually not) but false negatives (that is, saying "this isn't a bomb" when it actually is) would be catastrophic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With