Given an image of an object with a known, solid background color, how can I influence a CNN to ignore/discount the features of the background, thereby emphasizing the object?
For your information, my scenario is feature extraction (e.g. neural codes) for the purposes of content-based image retrieval (CBIR). I am using Caffe.
CNNs are used for image classification and recognition because of its high accuracy. It was proposed by computer scientist Yann LeCun in the late 90s, when he was inspired from the human visual perception of recognizing things.
When RGB image is used as input to CNN, the depth of filter (or kernel) is always equal to depth of image (so in case of RGB, that is 3). So, If 32x32x3 is the input image, the filter has to be NxNx3 (where N is height and width of filter like 3x3x3). Therefore, the filter has 3 two dimensional matrices.
The different layers of a CNN. There are four types of layers for a convolutional neural network: the convolutional layer, the pooling layer, the ReLU correction layer and the fully-connected layer.
As shown in Figure 1, the layers are built up so that the first layer detects a set of primitive patterns in the input, the second layer detects patterns of patterns, the third layer detects patterns of those patterns, and so on. Typical CNNs use 5 to 25 distinct layers of pattern recognition.
Enter Convolutional Neural Networks. In many cases, the colors in an image are unique — the exact color of a person’s clothing, the perfect shade of green for a tree, etc. are lost forever the second a black and white photo is taken. In other cases, though, colors are predictable — surprisingly so.
Meanwhile, Convolutional Neural Networks (CNN) tend to be multi-dimensional and contain some special layers, unsurprisingly called Convolutional layers. Moreover, Convolutional layers are often accompanied by Pooling layers (Max or Average), which help reduce the size of convolved features. Convolutional Neural Network. Image by author.
Hidden layers within Convolutional Neural Networks reduce the number of parameters by "tying" together the adjacent x weights surrounding each input neuron.
The CNN only has the data to learn if color is a decisive factor for recognizing an object or not. If you only present it with red 'A's, it will learn that red is a decisive factor for recognizing the 'A'. By presenting it with a large number of different 'A's that are colored differently.
One option I can think of is creating two inputs per image: the first is your 3xHxW color image and the other is 1xHxW mask with zeros on the background (the "solid-color" pixels) and ones otherwise. Then you can do element-wise multiplication of the mask with the output of your first conv layer, thus forcing all features of "solid-color" pixels to be zero.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With