Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backpropogation activation derivative

I've implemented backpropagation as explained in this video. https://class.coursera.org/ml-005/lecture/51

This seems to have worked successfully, passing gradient checking and allowing me to train on MNIST digits.

However, I've noticed most other explanations of backpropagation calculate the output delta as

d = (a - y) * f'(z) http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm

whilst the video uses.

d = (a - y).

When I multiply my delta by the activation derivative (sigmoid derivative), I no longer end up with the same gradients as gradient checking (at least an order of magnitude in difference).

What allows Andrew Ng (video) to leave out the derivative of the activation for the output delta? And why does it work? Yet when adding the derivative, incorrect gradients are calculated?

EDIT

I have now tested with linear and sigmoid activation functions on the output, gradient checking only passes when I use Ng's delta equation (no sigmoid derivative) for both cases.

like image 540
Kieren Anderson Avatar asked Oct 06 '15 06:10

Kieren Anderson


2 Answers

Found my answer here. The output delta does require multiplication by the derivative of the activation as in.

d = (a - y) * g'(z)

However, Ng is making use of the cross-entropy cost function which results in a delta that cancels the g'(z) resulting in the d = a - y calculation shown in the video. If a mean squared error cost function is used instead, the derivative of the activation function must be present.

like image 193
Kieren Anderson Avatar answered Sep 23 '22 23:09

Kieren Anderson


When using Neural Networks it depends on the learning task how you need to design your network. A common approach for regression tasks is to use the tanh() activation functions for the input and all hidden layers and then the output layer uses an linear activation function (img taken from here)

enter image description here

I did' not find the source, but there was an theorem which states that using non-linear together with linear activaion functions allows you to better approximate the target functions. An example of using different activation functions can be found here and here.

The are many different kinds of acitvation function which can be used (img taken from here). If you look at the derivatives you can see that the derivative of the linar function equals to 1 which then will not be mentions anymore. This is also the case for Ng,s explanation, if you look at minute 12 in the video you see that he is talking about the outputlayer.

enter image description here

Concerning the Backpropagation-Algorithm

"When neuron is located in the output layer of the network, it is supplied with a desired response of its own. We may use e(n) = d(n) - y(n) to compute the error signal e(n) associated with this neuron; see Fig. 4.3. Having determined e(n), we find it a straightforward matter to compute the local gradient [...] When neuron is located in a hidden layer of the network, there is no specified desired response for that neuron. Accordingly, the error signal for a hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron is directly connected"

Haykin, Simon S., et al. Neural networks and learning machines. Vol. 3. Upper Saddle River: Pearson Education, 2009. p 159-164

like image 42
Westranger Avatar answered Sep 23 '22 23:09

Westranger