Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

shall I apply softmax before cross entropy? [closed]

The pytorch tutorial (https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py) trains a convolutional neural network (CNN) on a CIFAR dataset.

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
            self.fc1 = nn.Linear(16 * 5 * 5, 120)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)

        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 16 * 5 * 5)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x 

The network looks good except that the very last layer fc3, which predicts the probability of belonging to 10 classes without a softmax. Shouldn't we apply a softmax first to make sure the output of the fc layer is between 0 and 1 and sum before calculating cross-entropy loss?

I tested this by applying the softmax and rerunning, butvthe accuracy dropped to around 35%. This seems counterintuitive. What is the explanation?

like image 997
Liyuan Zhang Avatar asked Mar 05 '23 09:03

Liyuan Zhang


1 Answers

CrossEntropyLoss in PyTorch is already implemented with Softmax:

https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

The answer to the second part of your question is a little more complicated. There can be multiple causes for reduction in accuracy. Theoretically speaking, since the softmax layer you added can predict the correct answer in a reasonable accuracy, the following layer should be able to do the same by preserving the maximum value with identity between the last two layers. Although the softmax normalizes those bounded outputs (between 0 and 1) again, it may change the way those are distributed, but still can preserve the maximum and therefore the class that is predicted.

However, in practice, things are a little bit different. When you have a double softmax in the output layer, you basically change the output function in such way that it changes the gradients that are propagated to your network. The softmax with cross entropy is a preferred loss function due to the gradients it produces. You can prove it to yourself by computing the gradients of the cost function, and account for the fact that each "activation" (softmax) is bounded between 0 and 1. The additional softmax "behind" the original one just multiplies the gradients with values between 0 and 1 and thus reducing the value. This affects the updates to the weights. Maybe it can be fixed by changing the learning rate but this is strongly not suggested. Just have one softmax and you're done.
See Michael Nielsen's book, chapter 3 for more profound explanation on that.

like image 167
mr_mo Avatar answered Mar 12 '23 03:03

mr_mo