As far as I know, CNN rely on sliding window techniques and can only indicate if a certain pattern is present or not anywhere in given bounding boxes. Is that true?
Can one achieve localization with CNN without any help of such techniques?
The Convolutional Neural Network (CNN or ConvNet) is a subtype of the Neural Networks that is mainly used for applications in image and speech recognition. Its built-in convolutional layer reduces the high dimensionality of images without losing its information.
The convolutional layer is the first layer of a convolutional network. While convolutional layers can be followed by additional convolutional layers or pooling layers, the fully-connected layer is the final layer. With each layer, the CNN increases in its complexity, identifying greater portions of the image.
To introduce non-linearity into the system and improve the learning capacity, the output from the convolution operation is passed through a non-saturating activation function like sigmoid or rectified linear unit (ReLU). Check out this excellent article about these and several other commonly used activation functions.
The reading of the matrix then begins, for which the software selects a smaller image, known as the ‘filter’ (or kernel). The depth of the filter is the same as the depth of the input. The filter then produces a convolution movement along with the input image, moving right along the image by 1 unit.
There are some recent techniques to localize the objects in CNN's. See this paper http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf
It uses a layer called Global Average Pooling (GAP), and with no additional work, the CNN can localize the object it recognizes.
Also checkout this really good blog post: https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/
Thats an open problem in image recognition. Besides sliding windows, existing approaches include predicting object location in image as CNN output, predicting borders (classifiyng pixels as belonging to image boundary or not) and so on. See for example this paper and references therein.
Also note that with CNN using max-pooling, one can identify positions of feature detectors that contributed to object recognition, and use that to suggest possible object location region.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With