I am working on HOG descriptors and I am pretty much done with most of the parts, except the fusion of the detection windows.
What I have done so far is; I build a scale space pyramid of the image and for each image on each scale I move the detection window(64x128) and detect humans. In each image a person is detected by more than one window.
So the question is how to fuse all these windows(assume for one person) into one window. Dalal suggests that one should use a robust mod detection algorithm, such as mean-shift. But, I have multiple scales... Should I first estimate the true location of the detection window found in lower levels of the scale space in order to do that?
Any help is appreciated. Thanks in advance.
My interpretation is that mean shift would give you in effect what you are suggesting.
Essentially, you estimate the probability distribution of the location of the person at the coarsest scale first based upon the strengths of the detector outputs. This gives you a robust estimate of mode.
You can then iteratively refine using the finer scales around the maximum or the mode.
The idea is very similar that used in pyramidal LK tracking, for example. You can also do ensemble processing and/or particle filters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With