So I am currently building a 2-channel (also called double channel) convolution neural network for measuring the similarity between 2 (binary) images.
The problem I am having is the following:
My input images are 40 x 50, and after 1 conv and 1 pooling layer (for example), the output size is 18 x 23. So how does one do more pooling without ending up with non-integer output sizes? For example, pooling a 18 x 23 image with size 2 x 2, the output size is given by 9 x 11.5.
I cannot seem to find any suitable kernel sizes to avoid such a problem, which in my opinion is a result of the fact that the original input image dimensions are not powers of 2. For example, input images of size 64 x 64 doesn't have this issue with the correct padding size and so on.
Any help is much appreciated.
Regarding your question:
So how does one do more pooling without ending up with non-integer output sizes?
Let's say you have:
import torch
from torch import nn
from torch.nn import functional as F
# equivalent to your (18 x 23) activation volume
x = torch.rand(1, 1, 4, 3)
print(x)
# tensor([[[[0.5005, 0.3433, 0.5252],
# [0.4878, 0.5266, 0.0237],
# [0.8600, 0.8092, 0.8912],
# [0.1623, 0.4863, 0.3644]]]])
If you apply pooling (I will use MaxPooling
in this example and I assume you meant a 2x2 pooling with stride=2
based on your expected output shape):
p = nn.MaxPool2d(2, stride=2)
y = p(x)
print(y.shape)
# torch.Size([1, 1, 2, 1])
print(y)
# tensor([[[[0.5266],
# [0.8600]]]])
If you would like to have a [1, 1, 2, 2]
, you can set the ceil_mode=True
of MaxPooling
:
p = nn.MaxPool2d(2, stride=2, ceil_mode=True)
y = p(x)
print(y.shape)
# torch.Size([1, 1, 2, 2])
print(y)
# tensor([[[[0.5266, 0.5252],
# [0.8600, 0.8912]]]])
You can also pad the volume to achieve the same (here I assume the volume has min=0
as if it was after a ReLU):
p = nn.MaxPool2d(2, stride=2)
y = p(F.pad(x, (0, 1), "constant", 0))
print(y.shape)
# torch.Size([1, 1, 2, 2])
print(y)
# tensor([[[[0.5266, 0.5252],
# [0.8600, 0.8912]]]])
Regarding:
I cannot seem to find any suitable kernel sizes to avoid such a problem, which in my opinion is a result of the fact that the original input image dimensions are not powers of 2.
Well, if you want to use Pooling operations that change the input size in half (e.g., MaxPooling with kernel=2
and stride=2
), then using an input with a power of 2 shape is quite convenient (after all, you'll be able to do many of these /2 operations). However, this is not required. You can change the stride of the pooling, you can always pool with ceil_mode=True
, you can also pad asymmetrically, and many other things. All of them are decisions you'll have to make when building your model :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With