Derivative of a softmax function explanation [closed]

Tags:

I am trying to compute the derivative of the activation function for softmax. I found this : https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function nobody seems to give the proper derivation for how we would get the answers for i=j and i!= j. Could someone please explain this! I am confused with the derivatives when a summation is involved as in the denominator for the softmax activation function.

418

asked Jun 13 '16 13:06

Roshini

1 Answers

The derivative of a sum is the sum of the derivatives, ie:

    d(f1 + f2 + f3 + f4)/dx = df1/dx + df2/dx + df3/dx + df4/dx

To derive the derivatives of p_j with respect to o_i we start with:

    d_i(p_j) = d_i(exp(o_j) / Sum_k(exp(o_k)))

I decided to use d_i for the derivative with respect to o_i to make this easier to read. Using the product rule we get:

     d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))

Looking at the first term, the derivative will be 0 if i != j, this can be represented with a delta function which I will call D_ij. This gives (for the first term):

    = D_ij * exp(o_j) / Sum_k(exp(o_k))

Which is just our original function multiplied by D_ij

    = D_ij * p_j

For the second term, when we derive each element of the sum individually, the only non-zero term will be when i = k, this gives us (not forgetting the power rule because the sum is in the denominator)

    = -exp(o_j) * Sum_k(d_i(exp(o_k)) / Sum_k(exp(o_k))^2
    = -exp(o_j) * exp(o_i) / Sum_k(exp(o_k))^2
    = -(exp(o_j) / Sum_k(exp(o_k))) * (exp(o_j) / Sum_k(exp(o_k)))
    = -p_j * p_i

Putting the two together we get the surprisingly simple formula:

    D_ij * p_j - p_j * p_i

If you really want we can split it into i = j and i != j cases:

    i = j: D_ii * p_i - p_i * p_i = p_i - p_i * p_i = p_i * (1 - p_i)

    i != j: D_ij * p_i - p_i * p_j = -p_i * p_j

Which is our answer.

answered Oct 11 '22 19:10

SirGuy

Related questions
                            
                                Split output of a layer in keras
                            
                                Adding an additional value to a Convolutional Neural Network Input? [closed]
                            
                                Using different loss functions for different outputs simultaneously Keras?
                            
                                Rank mismatch: Rank of labels (received 2) should equal rank of logits minus 1 (received 2)
                            
                                Pytorch custom activation functions?
                            
                                Loss clipping in tensor flow (on DeepMind's DQN)
                            
                                If we combine one trainable parameters with a non-trainable parameter, is the original trainable param trainable?
                            
                                Neural Net Optimize w/ Genetic Algorithm
                            
                                What is batch size in Caffe or convnets
                            
                                Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory
                            
                                I can't add optimizer parameter in gridsearch
                            
                                Looking for a Good Reference on Neural Networks [closed]
                            
                                How to access the network weights while using PyTorch 'nn.Sequential'?
                            
                                The concept of straight through estimator (STE) [closed]
                            
                                What is the difference between a layer with a linear activation and a layer without activation?
                            
                                Geometric representation of Perceptrons (Artificial neural networks)
                            
                                What are the uses of recurrent neural networks when using them with Reinforcement Learning?
                            
                                Machine Learning: Unsupervised Backpropagation
                            
                                Impact of using data shuffling in Pytorch dataloader
                            
                                Theano HiddenLayer Activation Function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Derivative of a softmax function explanation [closed]

Tags:

neural-network

softmax

calculus

derivative

Roshini

People also ask

1 Answers

SirGuy

Recent Activity

Donate For Us