In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4
attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M
, then iterate the M*M
pixels to extract N*N
patch (where N<M
). The iteration stride can some number like N/3
to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20
patches or 60
patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.
Patchwise training explicitly crops out the subimages and produces outputs for each subimage in independent forward passes. Therefore, fully convolutional training is usually substantially faster than patchwise training.
In the paper of fully convolutional neural network, the authors mention both patch wise training and fully convolutional training. My understanding for the training set construction is as follows: Given an M*M image, extract sub-images with N*N, where ( N<M ). The selected sub-images are overlapped with eath other.
The first thing that struck me was fully convolutional networks (FCNs). FCN is a network that does not contain any “Dense” layers (as in traditional CNNs) instead it contains 1x1 convolutions that perform the task of fully connected layers (Dense layers).
FCN_model: We need to specify the number of classes required in the final output layer. The above objects are passed to the train () function which compiles the model with Adam optimizer and categorical cross-entropy loss function. We create a checkpoint callback which saves the best model during training.
The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training. In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss. In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling". So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With