I am following along (code is a mess, I'm just messing around) with Introduction to the math of neural networks with this simple 3-layer neural net:
My calculations are coming out pretty much the same as the book (attributing difference to rounding):
o1 delta: 0.04518482993361776
h1 delta: -0.0023181625149143255
h2 delta: 0.005031782661407674
h1 -> o1: 0.01674174257328656
h2 -> o1: 0.033471787838638474
b2 -> o1: 0.04518482993361776
// didn't calculate layer 1 gradients but would use the same approach
But what exactly are the gradients? Are they the individual node's contribution to the error of o1?
The gradient is the generalization of the derivative to multivariate functions. It captures the local slope of the function, allowing us to predict the effect of taking a small step from a point in any direction.
A node, also called a neuron or Perceptron, is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection. Nodes are then organized into layers to comprise a network.
The gradient is a vector which gives us the direction in which loss function has the steepest ascent. The direction of steepest descent is the direction exactly opposite to the gradient, and that is why we are subtracting the gradient vector from the weights vector.
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. Gradient descent is simply used in machine learning to find the values of a function's parameters (coefficients) that minimize a cost function as far as possible.
Let first explain gradient descent. GRADIENT DESCENT is an optimization algorithm for minimizing a cost function.
Consider the following example:
Where f(t) is the function we want to minimize, t has some initial value t1 but we want to find one such that f(t) gets its minimal value.
This is the formula for the gradient descent algorithm:
t = t - α d/dt f(t),
where α ** is a learning rate and d/dt f(t) is the derivative of the function. And a derivative is simply the slope of the line that is tangent to the function.
We keep applying this formula until we reach the minimum.
Looking at the picture above, the gradient descent will update t in the following fashion: the slope (derivative) is positive, α is positive, so t value will decrease now, hence minimizing f(t). We repeat this until we get for d/dt f(t) == 0 (the slope of any cont. function at its minimum (and max.) is zero).
We can now apply the idea of gradient descent in our backpropagation algorithm in order to adjust properly our weights.
Given a training example e, we define the error function as
E_e (w ⃗ )= 1/2 ∑_(k ∈Outputs) (d_k- o_k )^2 ,
where k is the number of outputs in the neural network, d is desired output, and o is observed output.
Observation: If this function E equals 0, that means that for all k, d_k == o_k, which means the output of the neural network was the same as the desired one and there is no work to be done, e.i. our NN is very smart. Initially of course, the weights are assigned randomly and its never the case, but we want to achieve that (or nearly that hopefully).
Since we now have an error function which we want to minimize, does it click now what can we apply? The gradient descent yes! ^^ The idea is to modify the weights according to the negative of the gradient of the error function to get a fast reduction of error on this example (meaning e), so we revise the weights according to gradient information as
∆w_ji= α (-∂E(wij)/∂wij )
If you compare with the above example, this does exactly the same, the only difference is that in the latter case the error function is a multivariate function, e.i. for a given weight we find the partial derivative for that particular weight.)
Applying this error correction (gradient descent) to the weights over and over again, we would achieve a lever where the error function is minimized and our NN well trained.
*Note: there are many issues involved in this problem, for example, if the learning rate is too large, the gradient descent can even overshoot the minimum, but to avoid confusions, don’t care about this too much now. :)
I have not read the book but it sounds like you need to read the chapter on gradient descent
algorithm. I have found this course to be a very good introduction(https://class.coursera.org/ml-006/lecture) that starts from a very intuitive linear regression presentation.
the direct answer to your question is gradients are partial derivatives wrt node weights. Gradient descent tries to find a solution that minimizes some error function(usually mean squared error). How you find this combination is compute the derivative of function and update weights in the direction of the derivative using a small multiplier also known as learning rate. For a nested function like neural network, hidden layer's derivaive can be obtained via chain rule.
I would suggest trying to completely understand the simplest case, linear regression with one variable,where you can still plot your error function and see what it look like. After that understanding the neural network case will naturally follow. This is covered coursera ml course along with programming exercises.
Consider the cost function of neural network, J(theta). Where theta = (theta_1, theta_2, ..., theta_n) are the weights of the connections in the neural network. Our goal is to minimize the function J(theta) w.r.t.(with respect to) theta. Notice that J(theta) is a multivariate continuous function, mathematically the gradient of theta_i w.r.t J(theta) is simply the partial derivative of J w.r.t. theta_i. Now let's try to find out the physical meaning of gradients.
For demonstration consider theta is consist of only one variable, x. That is theta = (x). Now, in the function gradient of x w.r.t. J is simply J'(x), the derivative of J in the point x. Now, for sufficiently small alpha,
[J(x) is increasing at x]
==> [J'(x) >= 0]
==> [x - alpha * J'(x) <= x]
==> [J(x - alpha * J'(x)) <= J(x)]
Similarly,
[J(x) is decreasing at x]
==> [J'(x) <= 0]
==> [x - alpha * J'(x) >= x]
==> [J(x - alpha * J'(x)) <= J(x)]
So for small alpha we always get lower value J by changing x to x - alpha * J'(x). Also the greater (by absolute value) the gradient is the lower value you get by changing x. Now, if you plot J then you can see that J'(x) is the slope of the tangent line to J(x) at point x, and x - alpha * J'(x) shifts x to a minima. In other words gradient of x denotes an one dimensional vector, by both direction and magnitude, which points x to a minima.
Now consider the case where theta has n dimension. Then the gradients of theta_i w.r.t. J(theta) represents an n-dimensional vector which points to a minima.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With