I have a problem with classifying fully connected deep neural net with 2 hidden layers for MNIST dataset in pytorch.
I want to use tanh as activations in both hidden layers, but in the end, I should use softmax.
For the loss, I am choosing nn.CrossEntropyLoss()
in PyTOrch, which (as I have found out) does not want to take one-hot encoded labels as true labels, but takes LongTensor of classes instead.
My model is nn.Sequential()
and when I am using softmax in the end, it gives me worse results in terms of accuracy on testing data. Why?
import torch from torch import nn inputs, n_hidden0, n_hidden1, out = 784, 128, 64, 10 n_epochs = 500 model = nn.Sequential( nn.Linear(inputs, n_hidden0, bias=True), nn.Tanh(), nn.Linear(n_hidden0, n_hidden1, bias=True), nn.Tanh(), nn.Linear(n_hidden1, out, bias=True), nn.Softmax() # SHOULD THIS BE THERE? ) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.5) for epoch in range(n_epochs): y_pred = model(X_train) loss = criterion(y_pred, Y_train) print('epoch: ', epoch+1,' loss: ', loss.item()) optimizer.zero_grad() loss.backward() optimizer.step()
Here we have to be careful because the cross-entropy loss already applies the LogSoftmax and then the negative log-likelihood(nn. LogSoftmax+nn. NLLLoss). We must not implement the softmax layer for ourselves.
Categorical cross-entropy loss is closely related to the softmax function, since it's practically only used with networks with a softmax layer at the output.
When you have a double softmax in the output layer, you basically change the output function in such way that it changes the gradients that are propagated to your network. The softmax with cross entropy is a preferred loss function due to the gradients it produces.
In short, Softmax Loss is actually just a Softmax Activation plus a Cross-Entropy Loss. Softmax is an activation function that outputs the probability for each class and these probabilities will sum up to one. Cross Entropy loss is just the sum of the negative logarithm of the probabilities.
As stated in the torch.nn.CrossEntropyLoss()
doc:
This criterion combines
nn.LogSoftmax()
andnn.NLLLoss()
in one single class.
Therefore, you should not use softmax before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With