SciPy optimisation: Newton-CG vs BFGS vs L-BFGS

Q: What is Newton CG?

Another popular approach, known as “Newton-CG,” applies the (linear) conjugate gradient (CG) method to the second-order Taylor-series approximation of f around the current iterate x_k. Each iteration of CG requires computation of one Hessian-vector product of the form \nabla ^2 f(x_k) v.

Q: Is Bfgs gradient based?

The BFGS algorithm is a Gradient Based algorithm designed to solve unconstrained nonlinear optimization problems. The algorithm uses an hill-climbing optimization techniques that seeks a stationary point using an approximated Hessian matrix of the problem (more info Here).

Q: What is Bfgs B algorithm?

L-BFGS-B is a limited-memory quasi-Newton code for bound-constrained optimization, i.e. for problems where the only constraints are of the form l<= x <= u. The current release is version 3.0. The distribution file was last changed on 02/08/11. (If you have an optimization problem with general constraints, try KNITRO® )

Tags:

python

optimization

scipy

newtons-method

I am doing an optimisation problem using Scipy, where I am taking a flat network of vertices and bonds of size NNxNN, connecting two sides of it (i.e., making it periodic), and minimising an energy function, so that it curls up to form a cylinder. (See the links below.)

Since I have the function energy(xyz-position) and it's gradient, I decided to use the three methods recommended in the Scipy manual -- Newton-CG, BFGS, L-BFGS-B -- and compare how they performed.

I call the optimisation function as follows, and I merely replace 'Newton-CG' with 'BFGS' and 'L-BFGS-B' according to case:

from scipy.optimize import minimize
res = minimize(energy, xyzInit, method='Newton-CG', jac = energy_der,  options={'disp': True})

I found the following general behaviour (I am giving the output data for the case of NN=9, corresponding to a 3*9^2=243-dimensional parameter space) -

BFGS systematically failed to find the correct minimum (for low NN), and failed to converge at all for large NN. See https://plot.ly/~apal90/162/ for end result.

 NN=9
 Method: BFGS
 Warning: Desired error not necessarily achieved due to precision loss.
 Current function value: 204.465912
 Iterations: 1239
 Function evaluations: 1520
 Gradient evaluations: 1508
 Time taken for minimisation: 340.728140116

Newton-CG found the correct minimum for small NN (<=8), but starting from NN=9, returned an incorrect minimum (viz., a cylinder squashed at one end), and for higher values stopped even converging. Note: This behaviour was for some reason aggravated for odd NN's. See https://plot.ly/~apal90/164/
```
 NN=9
 Method: Newton-CG
 Optimization terminated successfully.
 Current function value: 7.954412
 Iterations: 49
 Function evaluations: 58
 Gradient evaluations: 1654
 Hessian evaluations: 0
 Time taken for minimisation: 294.203114033
```
L-BFGS-B found the correct minimum, and that too blazingly fast, for all NN's that I tested (up to NN=14). See https://plot.ly/~apal90/160/
```
 NN=9
 Method: L-BFGS-B
 Time taken for minimisation: 36.3749790192
```

Question: Why is L-BFGS-B superior in this case to the other two methods? In particular, why is it so much superior to BFGS, when both are supposed to be quasi-Newton methods that work (to my understanding), in exactly the same manner.

My thoughts on the situation: All three methods do quadratic approximations at every point x. For this, it needs a gradient and a Hessian. If the Hessian is not given, it must be calculated by the algorithm. In our case, where only the gradient is explicitly given, this is calculated at every step numerically by the algorithm. More specifically, what we require is the inverse of the Hessian, and this is a very expensive step, especially in higher dimensions. Now, Newton-CG calculates this inverse Hessian explicitly, hence it's longer time requirements. The quasi-Newton methods like BFGS and L-BFGS calculate an approximation to the Hessian (i.e., the curvature) based on the gradient, which is cheaper on time, and which is also supposedly a better estimate of the curvature about a point. Thus, for quadratic functions, Newton-CG converges faster, whereas for non-quadratic functions, the quasi-Newton functions converge better. L-BFGS is a lower memory version of BFGS that stores far less memory at every step than the full NxN matrix, hence it is faster than BFGS.

This explanation shows a divergence between Newton-CG and the quasi-Newton methods. What it does not explain is the inability of the algorithms to find the true minimum, and especially the disparity between BFGS and L-BFGS, which are both supposed to function in the same manner.

My general hunch on the long convergence times is that the system is non-quadratic (i.e. flat) about the minimum, and thus the algorithm oscillates about with converging. Beyond that, if BFGS and L-BFGS truly work in the same manner, I believe there must be some difference between the convergence tolerance levels of the Scipy algorithms. Otherwise, BFGS and L-BFGS don't really work in the same manner, and the latter probably calculates the Hessian far more accurately.

References --

http://www.scipy-lectures.org/advanced/mathematical_optimization/#newton-and-quasi-newton-methods

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization

https://en.wikipedia.org/wiki/Quasi-Newton_method

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-bfgs.html#optimize-minimize-bfgs

https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb

917

asked Feb 23 '17 19:02

ap21

1 Answers

Your question is missing two important information: The energy function and the initial guess. The energy function can be convex/non-convex, smooth/piecewise-smooth/discontinuous. For this reason, it makes it hard to fully answer your question in your context. However, I can explain some key differences between BFGS and L-BFGS-B.

Both methods are iterative methods for solving nonlinear optimization problems. They both approximate the Newton method by using an approximation of the Hessian of the function at every iteration. The key difference with the Newton method is that instead of computing the full Hessian at a specific point, they accumulate the gradients at previous points and use the BFGS formula to put them together as an approximation of the Hessian. Newton and BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum.

The original BFGS method accumulates all gradients since the given initial guess. There is two problems with this method. First, the memory can increase indefinitely. Second, for nonlinear problems, the Hessian at the initial guess is often not representative of the Hessian at the solution. The approximated Hessian will thus be biased until enough gradients are accumulated close to the solution. This can slow down convergence, but should, in my experience, still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS is the same as BFGS but with a limited-memory, which means that after some time, old gradients are discarded to leave more space for freshly computed gradients. This solves the problem of the memory, and it avoids the bias of the initial gradient. However, depending on the number of gradients kept in memory, the Hessian might never be precisely estimated, and can be another source of bias. This can also slow down convergence, but again, it should still converge with a good line search algorithm for energy functions that have a single local minimum.

L-BFGS-B is the same as L-BFGS but with bound constraints on the input variables. L-BFGS-B will stop optimizing variables that are on the boundary of the domain. Since you did not specify any constraints, this aspect of the algorithm does not apply to your problem.

My hypothesis is that you are trying to solve a smooth but non-convex problem using an initial guess that is far from the solution, and that you end up in a local minimum. Since you mentioned that you start from a flat configuration, I would not be surprised that you start in a singularity that leads to a degenerate Hessian, which can cause troubles for the rest of the optimization. The only difference between BFGS and L-BFGS in your case is that every iteration will compute a gradient that is slightly different, and that the L-BFGS method will end up following a path that leads to the global minimum.

167

answered Sep 18 '22 16:09

Gilles-Philippe Paillé

Related questions
                            
                                installing PyGObject via pip in virtualenv [duplicate]
                            
                                How can I pass null to an external library, using ctypes, with an argument declared with ctypeslib.ndpointer?
                            
                                Zigzag or wavy lines in matplotlib
                            
                                Pycharm: terminate all running processes
                            
                                Correct way to declare an image field, sqlalchemy
                            
                                Why does the OrderedDict keys view compare order-insensitive?
                            
                                Using InitSpider with splash: only parsing the login page?
                            
                                Why attribute lookup in Python is designed this way (precedence chain)?
                            
                                numpy array set ones between two values, fast
                            
                                IPython Notebook and SQL: 'ImportError: No module named sql' when running '%load_ext sql'
                            
                                Prediction is depending on the batch size in Keras
                            
                                Frequency Response Scipy.signal
                            
                                Use loc and iloc together in pandas
                            
                                ignoring rows with unmatching dtype in pandas
                            
                                Customize logging for external/third-party libs
                            
                                Loading python support in gdb
                            
                                Parallelize these nested for loops in python
                            
                                How to create GeometryCollection from geojson with python shapely?
                            
                                Can Numba be used with Tensorflow?
                            
                                How to compose two functions whose outer function supplies arguments to the inner function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With