How calculating hessian works for Neural Network learning

1 Answers

To understand the Hessian you first need to understand Jacobian, and to understand a Jacobian you need to understand the derivative

Derivative is the measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function
Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) so given a function ie. g(x,y)=-x+y^2 you will know, that it is better to minimize the value of x, while strongly maximize the vlaue of y. This is a base of gradient based methods, like steepest descent technique (used in the traditional backpropagation methods).
Jacobian is yet another generalization, as your function might have many values, like g(x,y)=(x+1, x*y, x-z), thus you now have 2*3 partial derivatives, one gradient per each output value (each of 2 values) thus forming together a matrix of 2*3=6 values.

Now, derivative shows you the dynamics of the function itself. But you can go one step further, if you can use this dynamics to find the optimum of the function, maybe you can do even better if you find out the dynamics of this dynamics, and so - compute derivatives of second order? This is exactly what Hessian is, it is a matrix of second order derivatives of your function. It captures the dynamics of the derivatives, so how fast (in what direction) does the change change. It may seem a bit complex at the first sight, but if you think about it for a while it becomes quite clear. You want to go in the direction of the gradient, but you do not know "how far" (what is the correct step size). And so you define new, smaller optimization problem, where you are asking "ok, I have this gradient, how can I tell where to go?" and solve it analogously, using derivatives (and derivatives of the derivatives form the Hessian).

You may also look at this in the geometrical way - gradient based optimization approximates your function with the line. You simply try to find a line which is closest to your function in a current point, and so it defines a direction of change. Now, lines are quite primitive, maybe we could use some more complex shapes like.... parabolas? Second derivative, hessian methods are just trying to fit the parabola (quadratic function, f(x)=ax^2+bx+c) to your current position. And based on this approximation - chose the valid step.

Fun fact, adding the momentum term to your gradient based optimization is (under sufficient conditions) approximating the hessian based optimization (and is far less computationally expensive).

170

answered Sep 21 '22 05:09

lejlot

Related questions
                            
                                What is the difference between xavier_initializer and xavier_initializer_conv2d?
                            
                                Keras cifar10 example validation and test loss lower than training loss
                            
                                How to do point-wise categorical crossentropy loss in Keras?
                            
                                Understanding input/output dimensions of neural networks
                            
                                How to take the average of the weights of two networks?
                            
                                Can somebody please explain the backpropagation algorithm to me?
                            
                                Tensorflow : Memory leak even while closing Session?
                            
                                Does dropout layer go before or after dense layer in TensorFlow?
                            
                                How does one use Pytorch (+ cuda) with an A100 GPU?
                            
                                How to handle extremely long LSTM sequence length?
                            
                                Keras: Expected 3 dimensions, but got array with shape - dense model
                            
                                Tensorflow: How to write op with gradient in python?
                            
                                LSTM RNN Backpropagation
                            
                                How can LSTM attention have variable length input
                            
                                Keras multiple binary outputs
                            
                                BatchNorm momentum convention PyTorch
                            
                                caffe data layer example step by step
                            
                                How can I use the output of intermediate layer of one model as input to another model?
                            
                                Derivative of activation function and use in backpropagation [closed]
                            
                                Example of Time Series Prediction using Neural Networks in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How calculating hessian works for Neural Network learning

Tags:

artificial-intelligence

neural-network

backpropagation

hessian-matrix

Iulian Rosca

People also ask

1 Answers

lejlot

Recent Activity

Donate For Us