Gradient Descent with constraints (lagrange multipliers)

Q: What is the constraint in Lagrange multiplier?

We want to optimize (i.e. find the minimum and maximum value of) a function, f(x,y,z) f ( x , y , z ) , subject to the constraint g(x,y,z)=k g ( x , y , z ) = k . Again, the constraint may be the equation that describes the boundary of a region or it may not be.

Q: What is gradient Lagrangian?

A gradient is just a vector that collects all the function's partial first derivatives in one place. Each element in the gradient is one of the function's partial first derivatives. An easy way to think of a gradient is that if we pick a point on some function, it gives us the “direction” the function is heading.

Tags:

machine-learning

gradient-descent

I'm trying to find the min of a function in N parameters using gradient descent. However I want to do that while limiting the sum of absolute values of the parameters to be 1 (or <= 1, doesn't matter). For this reason I am using the method of lagrange multipliers so if my function is f(x), I will be minimizing f(x) + lambda * (g(x)-1) where g(x) is a smooth approximation for the sum of absolute values of the parameters.

Now as I understand, the gradient of this function will only be 0 when g(x)=1, so that a method to find a local minimum should find the minimum of my function in which my condition is also satisfied. The problem is that this addition my function unbounded so that Gradient Descent simply finds larger and larger lambdas with larger and larger parameters (in absolute value) and never converges.

At the moment I'm using python's (scipy) implementation of CG so I would really prefer suggestions that do not require me to re-write / tweak the CG code myself but use an existing method.

873

asked Sep 05 '12 15:09

nickb

1 Answers

The problem is that when using Lagrange multipliers, the critical points don't occur at local minima of the Lagrangian - they occur at saddle points instead. Since the gradient descent algorithm is designed to find local minima, it fails to converge when you give it a problem with constraints.

There are typically three solutions:

Use a numerical method which is capable of finding saddle points, e.g. Newton's method. These typically require analytical expressions for both the gradient and the Hessian, however.
Use penalty methods. Here you add an extra (smooth) term to your cost function, which is zero when the constraints are satisfied (or nearly satisfied) and very large when they are not satisfied. You can then run gradient descent as usual. However, this often has poor convergence properties, as it makes many small adjustments to ensure the parameters satisfy the constraints.
Instead of looking for critical points of the Lagrangian, minimize the square of the gradient of the Lagrangian. Obviously, if all derivatives of the Lagrangian are zero, then the square of the gradient will be zero, and since the square of something can never be less then zero, you will find the same solutions as you would by extremizing the Lagrangian. However, if you want to use gradient descent then you need an expression for the gradient of the square of the gradient of the Lagrangian, which might not be easy to come by.

Personally, I would go with the third approach, and find the gradient of the square of the gradient of the Lagrangian numerically if it's too difficult to get an analytic expression for it.

Also, you don't quite make it clear in your question - are you using gradient descent, or CG (conjugate gradients)?

answered Sep 23 '22 08:09

Chris Taylor

Related questions
                            
                                What is the difference between these two ways of saving keras machine learning model weights?
                            
                                show feature names after feature selection
                            
                                How to sum leading diagonal of table in R
                            
                                Realistic time estimates for progress bars etc
                            
                                Machine Learning on server log data
                            
                                Does the dataset size influence a machine learning algorithm?
                            
                                What is rank in ALS machine Learning Algorithm in Apache Spark Mllib
                            
                                How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?
                            
                                NLP/Machine Learning text comparison [closed]
                            
                                Tensorflow: Where is tf.nn.conv2d Actually Executed?
                            
                                How to specify the correlation coefficient as the loss function in keras
                            
                                What does a weighted word embedding mean?
                            
                                Probability and Neural Networks
                            
                                How to calculate a partial Area Under the Curve (AUC)
                            
                                How to get feature Importance in naive bayes?
                            
                                Keras callback ReduceLROnPlateau - cooldown parameter
                            
                                Does GridSearchCV store all the scores for all parameter combinations?
                            
                                Why use softmax only in the output layer and not in hidden layers?
                            
                                How to add attention layer to a Bi-LSTM
                            
                                R: ggplot2 pointrange example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With