I am trying to understand <code>backpropagation</code> in a simple 3 layered neural network with <code>MNIST</code>. There is the input layer with <code>weights</code> and a <code>bias</code>. The labels are <code>MNIST</code> so it's a <code>10</code> class vector. The second layer is a <code>linear tranform</code>. The third layer is the <code>softmax activation</code> to get the output as probabilities. <code>Backpropagation</code> calculates the derivative at each step and call this the gradient. Previous layers appends the <code>global</code> or <code>previous</code> gradient to the <code>local gradient</code>. I am having trouble calculating the <code>local gradient</code> of the <code>softmax</code> Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself <pre class="prettyprint"><code>def softmax(x): """Compute the softmax of vector x.""" exps = np.exp(x) return exps / np.sum(exps) </code></pre> The derivative is explained with respect to when <code>i = j</code> and when <code>i != j</code>. This is a simple code snippet I've come up with and was hoping to verify my understanding: <pre class="prettyprint"><code>def softmax(self, x): """Compute the softmax of vector x.""" exps = np.exp(x) return exps / np.sum(exps) def forward(self): # self.input is a vector of length 10 # and is the output of # (w * x) + b self.value = self.softmax(self.input) def backward(self): for i in range(len(self.value)): for j in range(len(self.input)): if i == j: self.gradient[i] = self.value[i] * (1-self.input[i)) else: self.gradient[i] = -self.value[i]*self.input[j] </code></pre> Then <code>self.gradient</code> is the <code>local gradient</code> which is a vector. Is this correct? Is there a better way to write this?

As I said, you have <code>n^2</code> partial derivatives. If you do the math, you find that <code>dSM[i]/dx[k]</code> is <code>SM[i] * (dx[i]/dx[k] - SM[i])</code> so you should have: <pre class="prettyprint"><code>if i == j: self.gradient[i,j] = self.value[i] * (1-self.value[i]) else: self.gradient[i,j] = -self.value[i] * self.value[j] </code></pre> instead of <pre class="prettyprint"><code>if i == j: self.gradient[i] = self.value[i] * (1-self.input[i]) else: self.gradient[i] = -self.value[i]*self.input[j] </code></pre> By the way, this may be computed more concisely like so (vectorized): <pre class="prettyprint"><code>SM = self.value.reshape((-1,1)) jac = np.diagflat(self.value) - np.dot(SM, SM.T) </code></pre>

numpy : calculate the derivative of the softmax function

Tags:

python

neural-network

backpropagation

numpy

softmax

I am trying to understand backpropagation in a simple 3 layered neural network with MNIST.

There is the input layer with weights and a bias. The labels are MNIST so it's a 10 class vector.

The second layer is a linear tranform. The third layer is the softmax activation to get the output as probabilities.

Backpropagation calculates the derivative at each step and call this the gradient.

Previous layers appends the global or previous gradient to the local gradient. I am having trouble calculating the local gradient of the softmax

Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself

def softmax(x):
    """Compute the softmax of vector x."""
    exps = np.exp(x)
    return exps / np.sum(exps)

The derivative is explained with respect to when i = j and when i != j. This is a simple code snippet I've come up with and was hoping to verify my understanding:

def softmax(self, x):
    """Compute the softmax of vector x."""
    exps = np.exp(x)
    return exps / np.sum(exps)

def forward(self):
    # self.input is a vector of length 10
    # and is the output of 
    # (w * x) + b
    self.value = self.softmax(self.input)

def backward(self):
    for i in range(len(self.value)):
        for j in range(len(self.input)):
            if i == j:
                self.gradient[i] = self.value[i] * (1-self.input[i))
            else: 
                 self.gradient[i] = -self.value[i]*self.input[j]

Then self.gradient is the local gradient which is a vector. Is this correct? Is there a better way to write this?

526

asked Nov 13 '16 16:11

Sam Hammamy

3 Answers

I am assuming you have a 3-layer NN with W1, b1 for is associated with the linear transformation from input layer to hidden layer and W2, b2 is associated with linear transformation from hidden layer to output layer. Z1 and Z2 are the input vector to the hidden layer and output layer. a1 and a2 represents the output of the hidden layer and output layer. a2 is your predicted output. delta3 and delta2 are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.

enter image description here

This is a general scenario for a 3-layer NN (input layer, only one hidden layer and one output layer). You can follow the procedure described above to compute gradients which should be easy to compute! Since another answer to this post already pointed to the problem in your code, i am not repeating the same.

answered Oct 09 '22 01:10

Wasi Ahmad

As I said, you have n^2 partial derivatives.

If you do the math, you find that dSM[i]/dx[k] is SM[i] * (dx[i]/dx[k] - SM[i]) so you should have:

if i == j:
    self.gradient[i,j] = self.value[i] * (1-self.value[i])
else: 
    self.gradient[i,j] = -self.value[i] * self.value[j]

instead of

if i == j:
    self.gradient[i] = self.value[i] * (1-self.input[i])
else: 
     self.gradient[i] = -self.value[i]*self.input[j]

By the way, this may be computed more concisely like so (vectorized):

SM = self.value.reshape((-1,1))
jac = np.diagflat(self.value) - np.dot(SM, SM.T)

answered Oct 09 '22 02:10

Julien

np.exp is not stable because it has Inf. So you should subtract maximum in x.

def softmax(x):
    """Compute the softmax of vector x."""
    exps = np.exp(x - x.max())
    return exps / np.sum(exps)

If x is matrix, please check the softmax function in this notebook.