I am trying to compute the derivative of the activation function for softmax. I found this : https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function nobody seems to give the proper derivation for how we would get the answers for i=j and i!= j. Could someone please explain this! I am confused with the derivatives when a summation is involved as in the denominator for the softmax activation function.
So the derivative of the softmax function is given as, ∂pi∂aj={pi(1−pj)ifi=j−pj. piifi≠j. Or using Kronecker delta δij={1ifi=j0ifi≠j.
We must use softmax in training because the softmax is differentiable and it allows us to optimize a cost function. However, for inference sometimes we need a model just to output a single predicted value rather than a probability, in which case the argmax is more useful.
The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.
In short, Softmax Loss is actually just a Softmax Activation plus a Cross-Entropy Loss. Softmax is an activation function that outputs the probability for each class and these probabilities will sum up to one. Cross Entropy loss is just the sum of the negative logarithm of the probabilities.
When cross-entropy is used as loss function in a multi-class classification task, then 𝒚 is fed with the one-hot encoded label and the probabilities generated by the softmax layer are put in 𝑠. This way round we won't take the logarithm of zeros, since mathematically softmax will never really produce zero values.
The derivative of a sum is the sum of the derivatives, ie:
    d(f1 + f2 + f3 + f4)/dx = df1/dx + df2/dx + df3/dx + df4/dx
To derive the derivatives of p_j with respect to o_i we start with:
    d_i(p_j) = d_i(exp(o_j) / Sum_k(exp(o_k)))
I decided to use d_i for the derivative with respect to o_i to make this easier to read.
Using the product rule we get:
     d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))
Looking at the first term, the derivative will be 0 if i != j, this can be represented with a delta function which I will call D_ij.  This gives (for the first term):
    = D_ij * exp(o_j) / Sum_k(exp(o_k))
Which is just our original function multiplied by D_ij
    = D_ij * p_j
For the second term, when we derive each element of the sum individually, the only non-zero term will be when i = k, this gives us (not forgetting the power rule because the sum is in the denominator)
    = -exp(o_j) * Sum_k(d_i(exp(o_k)) / Sum_k(exp(o_k))^2
    = -exp(o_j) * exp(o_i) / Sum_k(exp(o_k))^2
    = -(exp(o_j) / Sum_k(exp(o_k))) * (exp(o_j) / Sum_k(exp(o_k)))
    = -p_j * p_i
Putting the two together we get the surprisingly simple formula:
    D_ij * p_j - p_j * p_i
If you really want we can split it into i = j and i != j cases:
    i = j: D_ii * p_i - p_i * p_i = p_i - p_i * p_i = p_i * (1 - p_i)
    i != j: D_ij * p_i - p_i * p_j = -p_i * p_j
Which is our answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With