Scenario:
In closed-set face recognition, if we have 10 people in a Gallery set, then the query images will be from among these 10 people. Each query will thus be assigned to one of the 10 people.
In open-set face recognition, query faces may come from people outside the 10 persons in the Gallery. These extra people are knows as "distractors". An example task can be found in the IJB-A challenge.
Question:
Suppose I have an SVM (one versus all) trained for each of the 10 identities. How am I to report accuracy in the open-set scenario? If a query image X comes in, my model will ALWAYS identify it as one out of the 10 people in my Gallery, albeit with a low score if that person is not among the 10 in the Gallery. So when reporting accuracy as a %, every distractor query image will give me a 0 accuracy, bringing down the overall accuracy of labeling each query image with its correct identity.
Is this the correct way to report recognition accuracy on open-set protocol? Or is there a standard way to set a threshold on the classification score, and say that "a Query image X has low score for every identity in Gallery, thus we know it is a distractor image and we will not consider this when computing our recognition accuracy"
Lastly, a caveat: This is very specific to biometrics and face recognition in particular. However, SO provides the most coherent answers, and also highly likely to find biometrics people active in the Vision and Image Processing tags at SO, which is why I am asking this here.
I come from the open-set world (license plate recognition), so it appears natural to me to define something like a lower confidence threshold for a positive recognition.
I would recommend looking at the histograms of recognition qualities/scores/confidences (whatever they're called in your domain) for people from your set and for distractors (i.e. one histogram for SVM_A with several images of Person A, one histogram for SVM_A with several images for other persons from your set, one histogram for SVM_A with several images for distractors).
The expected result (if your SVMs are well-behaved) is that you have basically only very high scores for SVM_i with person i and only very low scores for SVM_i with both other persons from your set and distractors. In particular, the results for 'other persons from set' and 'distractors' should be basically identical (in a statistical sense - they all should be recognized as 'NOT person i = very low score").
I would expect (hope) a natural cutoff position will present itself somewhere between highest false positive (SVM_A on non-A) and lowest true positive (SVM_A on A) scores.
You could also introduce distractors as an additional category besides your closed set persons and look at the recognition matrix (first line: A recognized as A, as B, ..., as N, as distractor, second line with B, ... last line: distractor recognized as A, as B, ... as N, as distractor) and calculate your correct classification percentage from that matrix.
edit: I now understand you're concerned with your average recognition confidence (right?). Since you have no way of explicitly training for non-set persons, I think it is fair to ignore those cases, where distractors were correctly identified as distractors (highest confidence of all SVMs is sub-threshold).
Self-anwered, after reading through some NIST protocols as reference [Sec 4]:
In the open-set scenario, two performance metrics are to be reported: CMC and DET.
The Cumulative Match Characteristic or CMC curve is computed only using matched images -- i.e. using those images in the Probe or Test set that are from the list of subjects present in the Gallery set. The CMC reports recall value at different ranks 1, 2, .... #classes.
The Decision Error Tradeoff or DET curve is used to quantify how good the system is at rejecting "impostors" or "distractors". There is one SVM per Gallery identity. So for each query image there would be #identity scores (for 50 identities in Gallery we would have 50 SVMs giving 50 scores). Taking the max out of the SVM scores would show how close the input image it to being part of the Gallery set of identities. Then the DET curve is plotted using these scores, which is very similar to ROC curves for verification. The axes are False Positive Identification Rate (FPIR) versus False Negative Identification Rate (FNIR).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With