I read that CNNs (with both convolution and max-pooling layers) are shift-invariant, but most object detection methods used a sliding window detector with non-maximum suppression. Is it necessary to use sliding windows with CNNs when doing object detection?
Basically, instead of training the network on small 50x50 patches of images containing the desired object, why not train on entire images where the object is present somewhere? All I can think of is practical/performance reasons (doing forward pass on smaller patches instead of whole images), but is there also a theoretical explanation I'm overlooking?
internally, CNN is doing a sliding window. Convolution in terms of 2d image is nothing more than a linear filter applied in the sliding window manner. This is simply nice, mathematical expression of the very same operation, which helps us do neat optimization. Max pooling on the other hand helps us to be robust in terms of small shifts/noise. So efficiently feeding image to the network is using (many!) sliding windows on it. Can we pass big images instead of small ones? Sure, but you wil get extremely big tensors (just compute how many numbers you will need, this is huge), and you will get really complex optimization problem. Nowadays we optimize in milions-dimensional space. Working with whole images might lead to bilions (or even bigger) number of dimensions. Optimization complexity grows exponentialy with the growth of the dimension, thus you will end up with extremely slow method (not in terms of computation itself - but convergence).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With