In the PyTorch tutorial, the constructed network is
Net(
(conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=120, bias=True)
(fc2): Linear(in_features=120, out_features=84, bias=True)
(fc3): Linear(in_features=84, out_features=10, bias=True)
)
And used to process images with dimensions 1x32x32
. They mention, that the network cannot be used with images with a different size.
The two convolutional layers seem to allow for an arbitrary number of features, so the linear layers seem to be related to getting the 32x32
into into 10
final features.
I do not really understand, how the numbers 120
and 84
are chosen there and why the result matches with the input dimensions.
And when I try to construct a similar network, I actually get the problem with the dimension of the data.
When I for example use a simpler network:
Net(
(conv1): Conv2d(3, 8, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(8, 16, kernel_size=(5, 5), stride=(1, 1))
(fc1): Linear(in_features=400, out_features=3, bias=True)
)
for an input of the size 3x1200x800
, I get the error message:
RuntimeError: size mismatch, m1: [1 x 936144], m2: [400 x 3] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:940
Where does the number 936144
come from and how do I need to design the network, such that the dimensions are matching?
PyTorch - nn.Linear nn. Linear(n,m) is a module that creates single layer feed forward network with n inputs and m output. Mathematically, this module is designed to calculate the linear equation Ax = b where x is input, b is output, A is weight.
Linear layers use matrix multiplication to transform their input features into output features using a weight matrix. The input features are received by a linear layer are passed in the form of a flattened one-dimension tensor and then multiplied by the weight matrix.
Input Dimension or Input Size is the number of features or dimensions you are using in your data set. In this case, it is one (Columns/ Features).
The key step is between the last convolution and the first Linear
block. Conv2d
outputs a tensor of shape [batch_size, n_features_conv, height, width]
whereas Linear
expects [batch_size, n_features_lin]
. To make the two align you need to "stack" the 3 dimensions [n_features_conv, height, width]
into one [n_features_lin]
. As follows, it must be that n_features_lin == n_features_conv * height * width
. In the original code this "stacking" is achieved by
x = x.view(-1, self.num_flat_features(x))
and if you inspect num_flat_features
it just computes this n_features_conv * height * width
product. In other words, your first conv must have num_flat_features(x)
input features, where x
is the tensor retrieved from the preceding convolution. But we need to calculate this value ahead of time, so that we can initialize the network in the first place...
The calculation follows from inspecting the operations one by one.
and this 5x5 is why in the tutorial you see self.fc1 = nn.Linear(16 * 5 * 5, 120)
. It's n_features_conv * height * width
, when starting from a 32x32 image. If you want to have a different input size, you have to redo the above calculation and adjust your first Linear
layer accordingly.
For the further operations, it's just a chain of matrix multiplications (that's what Linear
does). So the only rule is that the n_features_out
of previous Linear
matches n_features_in
of the next one. Values 120 and 84 are entirely arbitrary, though they were probably chosen by the author such that the resulting network performs well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With