Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

metrics for feature detection/extraction methods

I wonder how do we evaluate feature detection/extraction methods (SIFT,SURF,MSER...) for object detection and tracking like pedestrians, lane vehicles etc.. Are there standard metrics for comparison? I have read blogs like http://computer-vision-talks.com/2011/07/comparison-of-the-opencvs-feature-detection-algorithms-ii/ some research papers like this. The problem is the more I learn the more I am confused.

like image 767
sy456 Avatar asked Jan 15 '14 15:01

sy456


People also ask

Which algorithm is used for feature detection?

3.1 Feature detection evaluation The selected algorithms are SIFT, SURF, FAST, BRISK, and ORB. Selected detectors are applied to three images for locating keypoints. Each image contains a single objects.

What is feature extraction in object detection?

Feature extraction In object detection frameworks, people typically use pretrained image classification models to extract visual features, as these tend to generalise fairly well (e.g. a model trained on the MS CoCo dataset is able to extract fairly generic features).

What is feature extraction in image processing example?

Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.


1 Answers

It is very hard to estimate feature detectors per se, because features are only computation artifacts and not things that you are actually searching in images. Feature detectors do not make sense outside their intended context, which is affine-invariant image part matching for the descriptors that you have mentioned.

The very first usage of SIFT, SURF, MSER was multi-view reconstruction and automatic 3D reconstruction pipe-lines. Thus, these features are usually assessed from the quality of the 3D reconstrucution or image part matching that they provide. Roughly speaking, you have a pair of images that are related by a known transform (an affinity or an homography) and you measure the difference between the estimated homography (from the feature detector) and the real one. This is also the method used in the blog post that you quote by the way.

In order to assess the practical interest of a detector (and not only its precision in an ideal multi-view pipe-line) some additional measurements of stability (under geometric and photometric changes) were added: does the number of detected features vary, does the quality of the estimated homography vary, etc.

Accidentally, it happens that these detectors may also work (also it was not their design purpose) for object detection and track (in tracking-by-detection cases). In this case, their performance is classically evaluated from more-or-less standardized image datasets, and typically expressed in terms of precision (probability of good answer, linked to the false alarm rate) and recall (probability of finding an object when it is present). You can read for example Wikipedia on this topic.

Addendum: What exactly do I mean by accidentally?

Well, as written above, SIFT and the like were designed to match planar and textured image parts. This is why you always see example with similar images from a dataset of graffiti.

Their extension to detection and tracking was then developed in two different ways:

  • While doing multiview matching (with a spherical rig), Furukawa and Ponce built some kind of 3D locally-planar object model, that they applied then to object detection in presence of severe occlusions. This worlk exploits the fact that an interesting object is often locally planar and textured;
  • Other people developed a less original (but still efficient in good conditions) approach by considering that they had a query image of the object to track. Individual frame detections are then performed by matching (using SIFT, etc.) the template image with the current frame. This exploits the fact that there are few false matchings with SIFT, that objects are usually observed in a distance (hence are usually almost planar in images) and that they are textured. See for example this paper.
like image 197
sansuiso Avatar answered Nov 02 '22 12:11

sansuiso