I am reading multiple conflicting Stackoverflow posts and I'm really confused on what the reality is.
My question is the following. If I trained an FCN on 128x128x3
images, is it possible to feed an image of size 256x256x3
, or B)128x128
, or C) neither since the inputs have to be the same during training and testing?
Consider SO post #1. In this post, it suggests that the images have to be the same dimensions during input and output. This makes sense to me.
SO post #2: In this post, it suggests that we can forward a different sized image during test time and if you do some weird squeeze operations, this becomes possible. Not sure at all how this is possible.
SO post #3: In this post, it suggests that only the depth needs to be the same, not the height and width. How is this possible?
Bottom line as I understand it is, if I trained on 128x128x3
, then from the input layer to the first conv layer, (1) there is a fixed number of strides that take place. Consequently, (2) a fixed feature map size, and accordingly, (3) a fixed number of weights. If I suddenly change the input image size to 512x512x3
, there's no way that the feature maps from training and testing are even comparable, due to the difference in size UNLESS.
512x512
, then only the top 128x128
is considered and the rest of the image is ignoredCan someone clarify this? As you can see there are multiple posts regarding this with not a canonical answer. Hence, a community aided answer that everyone agrees on would be very helpful.
Here's my breakdown,
Yes, this is the standard way to do things. If you have variable sized inputs you crop/pad/resize them so that your inputs are all the same size.
Note tat this person is talking about a "fully convolutional network" not a "fully connected network". In a fully convolutional network, all the layers will be convolution layers and convolution layers have no issue with consuming arbitrary sized (width and height) inputs as long as the channel dimension is fixed.
The need to have fixed input size arises in standard convolutional networks because of the "flattening" done before feeding the convolution output to fully connected layers. So if you get rid of the fully connected layers (i.e. fully convolutional networks) you don't have that problem.
It is saying basically the same thing as Post 2 (in my eye). To summarise, if your convolution network has a fully connected layer, and you try to input variable sized inputs, you'll get a RunTimeError
. But if you have a convolutional output and you input a 7x7x512
(h x w x channel) input you'll get a (1x1x<output_channel>)
output, where if you input 8x8x512
input, you'll get a (2x2x<output_channel>)
output (because of the convolution operation).
The bottom line is that, if you're network has fully connected layers somewhere, you cannot directly feed variable sized inputs (without pad/crop/resize) but if your network is fully convolutional, you can.
One thing I don't know and can't comment is when the probability map is [None, n, n, num_classes]
sized (as in Post #2), how to bring that to [None, 1, 1, num_classes]
as you need to do that to perform tf.squeeze
.
Edit 1:
I am adding this section to clarify how the input/output/kernel of a convolution operation behaves when input size changes. As you can see, a change in the input will change the size (that is, height and width dimensions). But the kernel (which is of shape [height x width x in_channels x out_channels]
will not be affected during this change.
Hope this makes sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With