I wonder how do we evaluate feature detection/extraction methods (SIFT,SURF,MSER...) for object detection and tracking like pedestrians, lane vehicles etc.. Are there standard metrics for comparison? I have read blogs like http://computer-vision-talks.com/2011/07/comparison-of-the-opencvs-feature-detection-algorithms-ii/ some research papers like this. The problem is the more I learn the more I am confused.
3.1 Feature detection evaluation The selected algorithms are SIFT, SURF, FAST, BRISK, and ORB. Selected detectors are applied to three images for locating keypoints. Each image contains a single objects.
Feature extraction In object detection frameworks, people typically use pretrained image classification models to extract visual features, as these tend to generalise fairly well (e.g. a model trained on the MS CoCo dataset is able to extract fairly generic features).
Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.
It is very hard to estimate feature detectors per se, because features are only computation artifacts and not things that you are actually searching in images. Feature detectors do not make sense outside their intended context, which is affine-invariant image part matching for the descriptors that you have mentioned.
The very first usage of SIFT, SURF, MSER was multi-view reconstruction and automatic 3D reconstruction pipe-lines. Thus, these features are usually assessed from the quality of the 3D reconstrucution or image part matching that they provide. Roughly speaking, you have a pair of images that are related by a known transform (an affinity or an homography) and you measure the difference between the estimated homography (from the feature detector) and the real one. This is also the method used in the blog post that you quote by the way.
In order to assess the practical interest of a detector (and not only its precision in an ideal multi-view pipe-line) some additional measurements of stability (under geometric and photometric changes) were added: does the number of detected features vary, does the quality of the estimated homography vary, etc.
Accidentally, it happens that these detectors may also work (also it was not their design purpose) for object detection and track (in tracking-by-detection cases). In this case, their performance is classically evaluated from more-or-less standardized image datasets, and typically expressed in terms of precision (probability of good answer, linked to the false alarm rate) and recall (probability of finding an object when it is present). You can read for example Wikipedia on this topic.
Addendum: What exactly do I mean by accidentally?
Well, as written above, SIFT and the like were designed to match planar and textured image parts. This is why you always see example with similar images from a dataset of graffiti.
Their extension to detection and tracking was then developed in two different ways:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With