I am following along (code is a mess, I'm just messing around) with Introduction to the math of neural networks with this simple 3-layer neural net: <img src="https://i.stack.imgur.com/Lizth.png" alt="enter image description here"> My calculations are coming out pretty much the same as the book (attributing difference to rounding): <pre class="prettyprint"><code>o1 delta: 0.04518482993361776 h1 delta: -0.0023181625149143255 h2 delta: 0.005031782661407674 h1 -> o1: 0.01674174257328656 h2 -> o1: 0.033471787838638474 b2 -> o1: 0.04518482993361776 // didn't calculate layer 1 gradients but would use the same approach </code></pre> But what exactly are the gradients? Are they the individual node's contribution to the error of o1?

Let first explain gradient descent. GRADIENT DESCENT is an optimization algorithm for minimizing a cost function. Consider the following example: <img src="https://i.stack.imgur.com/iyWDg.jpg" alt="enter image description here"> Where f(t) is the function we want to minimize, t has some initial value t1 but we want to find one such that f(t) gets its minimal value. This is the formula for the gradient descent algorithm: <pre class="prettyprint"><code>t = t - α d/dt f(t), </code></pre> where α ** is a learning rate and d/dt f(t) is the derivative of the function. And a derivative is simply the slope of the line that is tangent to the function. We keep applying this formula until we reach the minimum. Looking at the picture above, the gradient descent will update t in the following fashion: the slope (derivative) is positive, α is positive, so t value will decrease now, hence minimizing f(t). We repeat this until we get for d/dt f(t) == 0 (the slope of any cont. function at its minimum (and max.) is zero). We can now apply the idea of gradient descent in our backpropagation algorithm in order to adjust properly our weights. Given a training example e, we define the error function as <pre class="prettyprint"><code>E_e (w ⃗ )= 1/2 ∑_(k ∈Outputs) (d_k- o_k )^2 , </code></pre> where k is the number of outputs in the neural network, d is desired output, and o is observed output. Observation: If this function E equals 0, that means that for all k, d_k == o_k, which means the output of the neural network was the same as the desired one and there is no work to be done, e.i. our NN is very smart. Initially of course, the weights are assigned randomly and its never the case, but we want to achieve that (or nearly that hopefully). Since we now have an error function which we want to minimize, does it click now what can we apply? The gradient descent yes! ^^ The idea is to modify the weights according to the negative of the gradient of the error function to get a fast reduction of error on this example (meaning e), so we revise the weights according to gradient information as <pre class="prettyprint"><code>∆w_ji= α (-∂E(wij)/∂wij ) </code></pre> If you compare with the above example, this does exactly the same, the only difference is that in the latter case the error function is a multivariate function, e.i. for a given weight we find the partial derivative for that particular weight.) Applying this error correction (gradient descent) to the weights over and over again, we would achieve a lever where the error function is minimized and our NN well trained. *Note: there are many issues involved in this problem, for example, if the learning rate is too large, the gradient descent can even overshoot the minimum, but to avoid confusions, don’t care about this too much now. :)

Consider the cost function of neural network, J(theta). Where theta = (theta_1, theta_2, ..., theta_n) are the weights of the connections in the neural network. Our goal is to minimize the function J(theta) w.r.t.(with respect to) theta. Notice that J(theta) is a multivariate continuous function, mathematically the gradient of theta_i w.r.t J(theta) is simply the partial derivative of J w.r.t. theta_i. Now let's try to find out the physical meaning of gradients. For demonstration consider theta is consist of only one variable, x. That is theta = (x). Now, in the function gradient of x w.r.t. J is simply J'(x), the derivative of J in the point x. Now, for sufficiently small alpha, <pre class="prettyprint"><code>[J(x) is increasing at x] ==> [J'(x) >= 0] ==> [x - alpha * J'(x) <= x] ==> [J(x - alpha * J'(x)) <= J(x)] </code></pre> Similarly, <pre class="prettyprint"><code>[J(x) is decreasing at x] ==> [J'(x) <= 0] ==> [x - alpha * J'(x) >= x] ==> [J(x - alpha * J'(x)) <= J(x)] </code></pre> So for small alpha we always get lower value J by changing x to x - alpha * J'(x). Also the greater (by absolute value) the gradient is the lower value you get by changing x. Now, if you plot J then you can see that J'(x) is the slope of the tangent line to J(x) at point x, and x - alpha * J'(x) shifts x to a minima. In other words gradient of x denotes an one dimensional vector, by both direction and magnitude, which points x to a minima. Now consider the case where theta has n dimension. Then the gradients of theta_i w.r.t. J(theta) represents an n-dimensional vector which points to a minima.

What do node gradients represent in a neural network?

Tags:

neural-network

I am following along (code is a mess, I'm just messing around) with Introduction to the math of neural networks with this simple 3-layer neural net: enter image description here

My calculations are coming out pretty much the same as the book (attributing difference to rounding):

o1 delta: 0.04518482993361776
h1 delta: -0.0023181625149143255
h2 delta: 0.005031782661407674
h1 -> o1: 0.01674174257328656
h2 -> o1: 0.033471787838638474
b2 -> o1: 0.04518482993361776
// didn't calculate layer 1 gradients but would use the same approach

But what exactly are the gradients? Are they the individual node's contribution to the error of o1?

239

asked Jul 15 '14 17:07

user1873073

3 Answers

Let first explain gradient descent. GRADIENT DESCENT is an optimization algorithm for minimizing a cost function.

Consider the following example: enter image description here

Where f(t) is the function we want to minimize, t has some initial value t1 but we want to find one such that f(t) gets its minimal value.

This is the formula for the gradient descent algorithm:

t = t - α  d/dt f(t),

where α ** is a learning rate and d/dt f(t) is the derivative of the function. And a derivative is simply the slope of the line that is tangent to the function.

We keep applying this formula until we reach the minimum.

Looking at the picture above, the gradient descent will update t in the following fashion: the slope (derivative) is positive, α is positive, so t value will decrease now, hence minimizing f(t). We repeat this until we get for d/dt f(t) == 0 (the slope of any cont. function at its minimum (and max.) is zero).

We can now apply the idea of gradient descent in our backpropagation algorithm in order to adjust properly our weights.

Given a training example e, we define the error function as

E_e (w ⃗ )=  1/2 ∑_(k ∈Outputs) (d_k- o_k )^2 ,

where k is the number of outputs in the neural network, d is desired output, and o is observed output.

Observation: If this function E equals 0, that means that for all k, d_k == o_k, which means the output of the neural network was the same as the desired one and there is no work to be done, e.i. our NN is very smart. Initially of course, the weights are assigned randomly and its never the case, but we want to achieve that (or nearly that hopefully).

Since we now have an error function which we want to minimize, does it click now what can we apply? The gradient descent yes! ^^ The idea is to modify the weights according to the negative of the gradient of the error function to get a fast reduction of error on this example (meaning e), so we revise the weights according to gradient information as

∆w_ji= α (-∂E(wij)/∂wij  )

If you compare with the above example, this does exactly the same, the only difference is that in the latter case the error function is a multivariate function, e.i. for a given weight we find the partial derivative for that particular weight.)

Applying this error correction (gradient descent) to the weights over and over again, we would achieve a lever where the error function is minimized and our NN well trained.

*Note: there are many issues involved in this problem, for example, if the learning rate is too large, the gradient descent can even overshoot the minimum, but to avoid confusions, don’t care about this too much now. :)

106

answered Oct 28 '22 14:10

Ranic

I have not read the book but it sounds like you need to read the chapter on gradient descent algorithm. I have found this course to be a very good introduction(https://class.coursera.org/ml-006/lecture) that starts from a very intuitive linear regression presentation.

the direct answer to your question is gradients are partial derivatives wrt node weights. Gradient descent tries to find a solution that minimizes some error function(usually mean squared error). How you find this combination is compute the derivative of function and update weights in the direction of the derivative using a small multiplier also known as learning rate. For a nested function like neural network, hidden layer's derivaive can be obtained via chain rule.

I would suggest trying to completely understand the simplest case, linear regression with one variable,where you can still plot your error function and see what it look like. After that understanding the neural network case will naturally follow. This is covered coursera ml course along with programming exercises.

answered Oct 28 '22 15:10

Zaw Lin

Consider the cost function of neural network, J(theta). Where theta = (theta_1, theta_2, ..., theta_n) are the weights of the connections in the neural network. Our goal is to minimize the function J(theta) w.r.t.(with respect to) theta. Notice that J(theta) is a multivariate continuous function, mathematically the gradient of theta_i w.r.t J(theta) is simply the partial derivative of J w.r.t. theta_i. Now let's try to find out the physical meaning of gradients.

For demonstration consider theta is consist of only one variable, x. That is theta = (x). Now, in the function gradient of x w.r.t. J is simply J'(x), the derivative of J in the point x. Now, for sufficiently small alpha,

[J(x) is increasing at x]
==> [J'(x) >= 0]
==> [x - alpha * J'(x) <= x]
==> [J(x - alpha * J'(x)) <= J(x)]

Similarly,

[J(x) is decreasing at x]
==> [J'(x) <= 0]
==> [x - alpha * J'(x) >= x]
==> [J(x - alpha * J'(x)) <= J(x)]

So for small alpha we always get lower value J by changing x to x - alpha * J'(x). Also the greater (by absolute value) the gradient is the lower value you get by changing x. Now, if you plot J then you can see that J'(x) is the slope of the tangent line to J(x) at point x, and x - alpha * J'(x) shifts x to a minima. In other words gradient of x denotes an one dimensional vector, by both direction and magnitude, which points x to a minima.

Now consider the case where theta has n dimension. Then the gradients of theta_i w.r.t. J(theta) represents an n-dimensional vector which points to a minima.

answered Oct 28 '22 13:10

Corei13

Related questions
                            
                                Does Convolutional Neural Network possess localization abilities on images?
                            
                                Will larger batch size make computation time less in machine learning?
                            
                                TypeError: 'numpy.float64' object is not iterable Keras
                            
                                TensorFlow: Does it only have SGD algorithms? or does it also have others like LBFGS
                            
                                How to reduce a fully-connected (`"InnerProduct"`) layer using truncated SVD
                            
                                How to use Keras to build a Part-of-Speech tagger?
                            
                                Tensorflow: Replacement for tf.nn.rnn_cell._linear(input, size, 0, scope)
                            
                                keras usage of the Activation layer instead of activation parameter
                            
                                How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?
                            
                                How inverting the dropout compensates the effect of dropout and keeps expected values unchanged?
                            
                                Balanced Accuracy Score in Tensorflow
                            
                                How to determine for which value artificial neuron will fire?
                            
                                Neural Networks and Image Processing to Shoot Caterpillars w/ Lasers
                            
                                PyBrain: When creating network from ground up how and where do you create a bias?
                            
                                Caffe output layer number accuracy
                            
                                How reconstruct the caffe net by using pycaffe
                            
                                caffe: model definition: write same layer with different phase using caffe.NetSpec()
                            
                                Tensor Flow - LSTM - 'Tensor' object not iterable
                            
                                Setting the number of output nodes in scikit-learn's MLPClassifier
                            
                                Neuroph Vs Encog

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With