I am reading this document, and they stated that the weight adjustment formula is this:
new weight = old weight + learning rate * delta * df(e)/de * input
The df(e)/de
part is the derivative of the activation function, which is usually a sigmoid function like tanh
.
learning rate * delta * input
enough?This question came after this one and is closely related to it: Why must a nonlinear activation function be used in a backpropagation neural network?.
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
Derivative are fundamental to optimization of neural network. Activation functions allow for non-linearity in an inherently linear model (y = wx + b), which nothing but a sequence of linear operations.
In simple terms, after each forward pass through a network, backpropagation performs a backward pass while adjusting the model's parameters (weights and biases).
The chain rule allows us to find the derivative of composite functions. It is computed extensively by the backpropagation algorithm, in order to train feedforward neural networks.
Training a neural network just refers to finding values for every cell in the weight matrices (of which there are two for a NN having one hidden layer) such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted with each iteration (their initial values are often set to random values). This is also called the online model, as opposed to the batch one where weights are adjusted after a lot of iterations.
But how should the weights be adjusted--i.e., which direction +/-? And by how much?
That's where the derivative come in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means you are far from a minima. Put another way, weights are adjusted at each iteration in the direction of steepest descent (highest value of the derivative) on the cost function's surface defined by the total error (observed versus predicted).
After the error on each pattern is computed (subtracting the actual value of the response varible or output vector from the value predicted by the NN during that iteration), each weight in the weight matrices is adjusted in proportion to the calculated error gradient.
Because the error calculation begins at the end of the NN (i.e., at the output layer by subtracting observed from predicted) and proceeds to the front, it is called backprop.
More generally, the derivative (or gradient for multivariable problems) is used by the optimization technique (for backprop, conjugate gradient is probably the most common) to locate minima of the objective (aka loss) function.
It works this way:
The first derivative is the point on a curve such that a line tangent to it has a slope of 0.
So if you are walking around a 3D surface defined by the objective function and you walk to a point where slope = 0, then you are at the bottom--you have found a minima (whether global or local) for the function.
But the first derivative is more important than that. It also tells you if you are going in the right direction to reach the function minimum.
It's easy to see why this is so if you think about what happens to the slope of the tangent line as the point on the curve/surface is moved down toward the function minimumn.
The slope (hence the value of the derivative of the function at that point) gradually decreases. In other words, to minimize a function, follow the derivative--i.e, if the value is decreasing then you are moving in the correct direction.
The weight update formula you cite isn't just some arbitrary expression. It comes about by assuming an error function and minimizing it with gradient descent. The derivative of the activation function is thus there because, essentially, of the chain rule of calculus.
Books on neural networks are more likely to have the derivation of the update rule in backpropagation. For example, Introduction to the theory of neural computation by Hertz, Krogh, and Palmer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With