I am trying to understand how backpropagation works mathematically, and want to implement it in python with numpy. I use a feedforward neural network with one hidden layer for my calculations, sigmoid as activation function, mean squared error as error function. This is the screenshot of the result of my calculations:
, and the problem is that there is a bunch of matrices, and i cannot multiply them out completely because they don't have same dimensions.
(In the screenshot L is the output layer, L-1 is hidden layer, L-2 is input layer, W is weight, E is error function, lowercase A is activations)
(In the code the first layer has 28*28 nodes, [because i am using MNIST database of 0-9 digits as training data], hidden layer is 15 nodes, output layer is 10 nodes).
# ho stands for hidden_output
# ih stands for input_hidden
def train(self, input_, target):
self.input_ = input_
self.output = self.feedforward(self.input_)
# Derivative of error with respect to weight between output layer and hidden layer
delta_ho = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.hidden
# Derivative of error with respect to weight between input layer and hidden layer
delta_ih = (self.output - target) * sigmoid(np.dot(self.weights_ho, self.hidden), True)) * self.weights_ho * sigmoid(np.dot(self.weights_ih, self.input_), True) * self.input_
# Adjust weights
self.weights_ho -= delta_ho
self.weights_ih -= delta_ih
At the delta_ho = ... line, the dimensions of the matrices are (10x1 - 10x1) * (10x1) * (1x15) so how do i compute this? Thanks for any help!
Here is a note from CS231 of Stanford: http://cs231n.github.io/optimization-2/.
For back-propagation with matrix/vectors, one thing to remember is that the gradient w.r.t. (with respect to) a variable (matrix or vector) always have the same shape as the variable.
For example, if the loss is l, there is a matrix multiplication operation in the calculation of loss: C = A.dot(B). Let's suppose A has shape (m, n) and B has shape (n, p) (hence C has shape (m, p)). The gradient w.r.t. C is dC, which also has shape (m, p). To obtain a matrix that has the shape as A using dC and B, we can only to dC.dot(B.T) which is the multiplication of two matrices of shape (m, p) and (p, n) to obtain dA, the gradient of the loss w.r.t. A. Similarly the gradient of the loss w.r.t. B is dB = A.T.dot(dC).
For any added operation such as sigmoid you can chain them backwards as everywhere else.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With