Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Derivative of activation function and use in backpropagation [closed]

I am reading this document, and they stated that the weight adjustment formula is this:

new weight = old weight + learning rate * delta * df(e)/de * input

The df(e)/de part is the derivative of the activation function, which is usually a sigmoid function like tanh.

  • What is this actually for?
  • Why are we even multiplying with that?
  • Why isn't just learning rate * delta * input enough?

This question came after this one and is closely related to it: Why must a nonlinear activation function be used in a backpropagation neural network?.

like image 334
corazza Avatar asked Mar 20 '12 11:03

corazza


People also ask

What is a derivative of the activation function used for in backpropagation?

We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.

What is the derivative of activation function?

Derivative are fundamental to optimization of neural network. Activation functions allow for non-linearity in an inherently linear model (y = wx + b), which nothing but a sequence of linear operations.

Does backpropagation adjust the activation function?

In simple terms, after each forward pass through a network, backpropagation performs a backward pass while adjusting the model's parameters (weights and biases).

Which rules is used in backpropagation for differentiation?

The chain rule allows us to find the derivative of composite functions. It is computed extensively by the backpropagation algorithm, in order to train feedforward neural networks.


2 Answers

Training a neural network just refers to finding values for every cell in the weight matrices (of which there are two for a NN having one hidden layer) such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted with each iteration (their initial values are often set to random values). This is also called the online model, as opposed to the batch one where weights are adjusted after a lot of iterations.

But how should the weights be adjusted--i.e., which direction +/-? And by how much?

That's where the derivative come in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means you are far from a minima. Put another way, weights are adjusted at each iteration in the direction of steepest descent (highest value of the derivative) on the cost function's surface defined by the total error (observed versus predicted).

After the error on each pattern is computed (subtracting the actual value of the response varible or output vector from the value predicted by the NN during that iteration), each weight in the weight matrices is adjusted in proportion to the calculated error gradient.

Because the error calculation begins at the end of the NN (i.e., at the output layer by subtracting observed from predicted) and proceeds to the front, it is called backprop.


More generally, the derivative (or gradient for multivariable problems) is used by the optimization technique (for backprop, conjugate gradient is probably the most common) to locate minima of the objective (aka loss) function.

It works this way:

The first derivative is the point on a curve such that a line tangent to it has a slope of 0.

So if you are walking around a 3D surface defined by the objective function and you walk to a point where slope = 0, then you are at the bottom--you have found a minima (whether global or local) for the function.

But the first derivative is more important than that. It also tells you if you are going in the right direction to reach the function minimum.

It's easy to see why this is so if you think about what happens to the slope of the tangent line as the point on the curve/surface is moved down toward the function minimumn.

The slope (hence the value of the derivative of the function at that point) gradually decreases. In other words, to minimize a function, follow the derivative--i.e, if the value is decreasing then you are moving in the correct direction.

like image 64
doug Avatar answered Oct 01 '22 05:10

doug


The weight update formula you cite isn't just some arbitrary expression. It comes about by assuming an error function and minimizing it with gradient descent. The derivative of the activation function is thus there because, essentially, of the chain rule of calculus.

Books on neural networks are more likely to have the derivation of the update rule in backpropagation. For example, Introduction to the theory of neural computation by Hertz, Krogh, and Palmer.

like image 44
Michael J. Barber Avatar answered Oct 01 '22 05:10

Michael J. Barber