In a lot of research papers I read about Convolutional Neural Networks (CNN), I see that people randomly crop a square region (e.g. 224x224) from the images and then randomly flip it horizontally. Why is this random cropping and flipping done? Also, why do people always crop a square region. Can CNNs not work on rectangular regions?
The idea behind cropping is that to reduce the contribution of the background in the CNNs decision. That's useful if you have labels for locating where your object is.
Random crop is a data augmentation technique wherein we create a random subset of an original image. This helps our model generalize better because the object(s) of interest we want our models to learn are not always wholly visible in the image or the same scale in our training data.
The training set is a subset of the data set used to train a model. x_train is the training data set. y_train is the set of labels to all the data in x_train .
In the Downsampling network, simple CNN architectures are used and abstract representations of the input image are produced. In the Upsampling network, the abstract image representations are upsampled using various techniques to make their spatial dimensions equal to the input image.
This is referred to as data augmentation. By applying transformations to the training data, you're adding synthetic data points. This exposes the model to additional variations without the cost of collecting and annotating more data. This can have the effect of reducing overfitting and improving the model's ability to generalize.
The intuition behind flipping an image is that an object should be equally recognizable as its mirror image. Note that horizontal flipping is the type of flipping often used. Vertical flipping doesn't always make sense but this depends on the data.
The idea behind cropping is that to reduce the contribution of the background in the CNNs decision. That's useful if you have labels for locating where your object is. This lets you use surrounding regions as negative examples and building a better detector. Random cropping can also act as a regularizer and base your classification on the presence of parts of the object instead of focusing everything on a very distinct feature that may not always be present.
Why do people always crop a square region?
This is not a limitation of CNNs. It could be a limitation of a particular implementation. Or by design because assuming a square input can lead to optimizing the implementation for speed. I wouldn't read too much into this.
CNNs with variable sized input vs. fixed input:
This is not specific to cropping to a square but more generally why the input is sometimes resized/cropped/warped before inputting into a CNN:
Something to keep in mind is that designing a CNN involves deciding on whether to support variable-sized input or not. Convolution operations, pooling and non-linearities will work for any input dimensions. However, when use CNNs for solving image classification you usually end up with a fully-connected layer(s) such as logistic regression or MLP. The fully-connected layer is how the CNN produces a fixed-size output vector. The fixed-sized output can restrict the CNN to a fixed-sized input.
There are definitely workarounds to allow for variable-sized input and still produce a fixed sized output. The simplest is to use a convolution layer to perform classification over regular patches in an image. This idea has been around for a while. The intention behind it was to detect multiple occurrences of objects in the image and classify each occurrence. The earliest example I can think of is the work by Yann LeCun's group in the 1990s to simultaneously classify and localize digits in a string. This is referred to as turning a CNN with fully-connected layers into fully convolutional network. Most recent examples of fully-convolutional networks are applied to solve semantic segmentation and classify each pixel in an image. Here it is required to produce an output that matches the dimensions of the input. Another solution is to use global pooling at the end of a CNN to turn variable sized feature maps to fixed size output. The size of the pooling window is set equal to the feature map computed from the last conv. layer.
@ypx is already giving a good answer on why data-augmentation is needed. I am going to share more information about why people use square images of fixed size as input.
If you have basic knowledge about convolutional neural networks, you will know that for convolutional, pooling layers and non-linearity layers, it is fine that the input images have variable size. But neural networks usually have fully-connected layers as classifiers, the weight between last conv layers and first fully-connected layer is fixed. If you give the network variable size input image, there will be a problem because the feature map size and weight do not match. That is one reason fixed size input image is used.
Another reason is that by fixing the image size, the training time of neural networks can be reduced. This is because most (if not all) deep learning packages are written to process a batch of images in tensor format (usually in shape (N, C, H, W), N is the batchsize, C is the channel number, H and W are width and height of the image). If your input images do not have fixed size, you can not pack them into a batch. Even if you network can process variable size input image, you still have to input 1 image at a time. This is slower compared to batch processing.
Yes, as long as you can produce fixed size input for fully-connected layers, the input image size does not matter. A good choice is adaptive pooling, which will produce fixed output feature maps from variable size input feature maps. Right now, PyTorch provide two adaptive pooling layers for images, that is AdaptiveMaxPool2d and AdaptiveAvgPool2d. You can use layers to construct a neural network which can accept variable size input images.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With