Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Histogram of Oriented Gradients

I have been reading theory about HOG descriptors for object(human) detection. But I have some questions about the implementation, which might sound like an insignificant detail.

Regarding the window that contains the blocks; should the window be moved over the image pixel by pixel where the windows overlap at each step, as illustrated here: enter image description here

or should the window be moved without causing any overlapping, as here: enter image description here

The illustrations that I have seen so far used the second approach. But, considering the detection window being size of 64x128, it is highly probable that by sliding the window over the image one cannot cover the whole image. In case of image being size of 64x255, then the last 127 pixel will not be check for object. So, first approach seems more reasonable, however, more time and cpu consuming.

Any ideas? Thank you in advance.

EDIT: I try to stick to the original paper of Dalal and Triggs. One paper that implemented the algorithm and uses the second approach can be found here: http://www.cs.bilkent.edu.tr/~cansin/projects/cs554-vision/pedestrian-detection/pedestrian-detection-paper.pdf

like image 315
Ahmet Keskin Avatar asked Apr 08 '11 14:04

Ahmet Keskin


People also ask

How does Histogram of Oriented Gradients work?

The HOG descriptor focuses on the structure or the shape of an object. It is better than any edge descriptor as it uses magnitude as well as angle of the gradient to compute the features. For the regions of the image it generates histograms using the magnitude and orientations of the gradient.

What is Histogram of Oriented Gradients for face detection?

Histograms of Oriented Gradients are an effective descriptor for object recognition and detection. These descriptors are powerful to detect faces with occlusions, pose and illumination changes because they are extracted in a regular grid.

What is HOG and SVM?

Histogram of oriented gradients (HOG) is used for feature extraction in the human detection process, whilst linear support vector machines (SVM) are used for human classification. A set of tests is conducted to find the classifiers which optimize recall in the detection of persons in visible video sequences.

What is the output of HOG?

In the case of the HOG feature descriptor, the input image is of size 64 x 128 x 3 and the output feature vector is of length 3780.


1 Answers

EDIT: Sorry -- I misunderstood your question. (Also, the answer I provided to the wrong question was in error -- I've since adjusted that below for context.)

You're asking about using the HOG descriptor for detection, not generating the HOG descriptor.

In the implementation paper you reference above, it looks like they are overlapping the detection window. The window size is 64x128, while they use a horizontal stride of 32 pixels and a vertical stride of 64. They also mention that they tried smaller stride values, but this led to a higher false positive rate (in the context of their implementation.)

On top of that, they're using 3 scales of the input image: 1, 1/2, and 1/4. They don't mention any corresponding scaling of the detection window -- I'm not sure what effect that would have from a detection standpoint. It seems that this would implicitly create overlap as well.


Original answer (corrected):

Looking at the Dalal and Triggs paper (in section 6.4) it looks like they mention both i) no block overlap, as well as ii) half- and quarter- block overlap when generating the HOG descriptor. Based on their results, it sounds like greater overlap produced better detection performance (albeit at a greater resource/processing cost).

like image 58
Rick Haffey Avatar answered Oct 19 '22 19:10

Rick Haffey