Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Precision of scipy.optimize.minimize

I am trying to calculate the minimal point of function

f(x)=(x-2e-17)*(x-2e-17)

with scipy.optimize.minimize. The expected, precise result is 2e-17. But no matter how I fine tune the tolerence parameters xtol and ftol of scipy.optimize.minimize, it still only gives the imprecise result 0 (see below) . How can we let scipy return the precise answer? Thanks.

In [35]: scipy.optimize.minimize(lambda x: (x-2e-17)**2,2,method='Powell',options={'xtol': 1e-30, 'ftol': 1e-30})
Out[35]:
  status: 0
 success: True
   direc: array([[ 1.]])
    nfev: 20
     fun: array(4.0000000000000006e-34)
       x: array(0.0)
 message: 'Optimization terminated successfully.'
     nit: 2
like image 846
zell Avatar asked Nov 27 '15 20:11

zell


1 Answers

I understand your technical problem, but in my opinion it stems from inappropriate use of an optimiser. I will indulge myself with some philosophical ramblings before answering the question you asked.


"Typical" optimisation problems "with useful answers" have optimal function values within a few (i.e. substantially fewer than 17) orders of magnitude of 1 that are attained at a point whose coordinates are all within a few orders of magnitude of 1. (Or the optimal value is zero, or some of the optimal coordinates are zero. But users are often still satisfied with very small objective values and coordinates in that case.)

Often, the objective functions fed to black-box optimisers (and their gradients, also fed to some black-box optimisers) aren't written particularly carefully. Near the optimiser, the computed gradient of f will be dominated by roundoff error. The gradient may even point away from the optimum. A black-box optimiser is substantially less useful if it loops forever taking steps of length 0 or bombs out with an error when it gets very close to the optimum, hence parameters with names like "ftol" and "gtol" with fairly liberal default values such as 1e-4.

Even in the ideal case, where the user provides a function that always returns at x the closest floating-point number to f(x) and another function that always returns at x the correctly-rounded gradient to f at x, trying to find the floating-point vector that minimises f is a very ugly discrete optimisation problem. (NP-hard, if memory serves.) If f is a well-scaled quadratic evaluated in a correctly-rounded way---about the nicest nontrivial case I can imagine---the ugly discrete behaviour starts overwhelming the nice continuous behaviour when you start taking steps of length around 1e-8.

Methods based on a line search will find themselves computing the minimum over all t of f(x + td) for some point x and some direction d. Consider what f(x + td) is in floating-point arithmetic; for some t, you compute x+td somehow, at best getting the floating-point vector closest to x+td, then you plug it into f. In general, this line search will evaluate f along a jagged line through x meandering in the direction d. Even if f is well-behaved and implemented well, the line search can at a fine scale uncover very bad behaviour. Hence parameters with names like xtol that say when to stop a line search.

Lots of methods---pretty much everything I can think of besides straight Newton's method---needs to take some kind of guess about what scale is reasonable for your problem in order to start up. (BFGS usually takes the identity matrix as an initial guess. I think L-BFGS takes a unit step for its first step. Gradient descent methods often try a fixed multiple of the gradient first. Trust region methods use trust regions, which have to start with some radius. If you're doing numerical differentiation, your step length needs to be big enough that you're capturing the "continuous" behaviour rather than the "discrete" behaviour of the function, but small enough that you're capturing its fine behaviour rather than its gross behaviour near your point.)

Here, you're optimising a function whose optimal value is zero, attained very close to zero. In theory, nothing I said above about problems being horrible and their subproblems being horrible needs to apply. But do you really expect solvers to have a special case for functions whose optimal value is zero, attained very close to zero? Especially when that's extra code for (likely) reduced robustness? Why not just feed the solver a well-scaled problem instead?


To answer your direct question, Powell's method in scipy calls Brent's line search, starting with the coordinate directions. Brent's line search, as implemented in scipy, cranks up whatever tolerance you feed it by an additive 1e-11. If you hack scipy.optimize so that Brent's _mintol is 1e-111 instead, I'd bet you get the desired answer. (_mintol is an absolute tolerance in x that's added to the relative tolerance you specify. It's there so that the line search doesn't waste function evaluations deciding whether to step by 1e-200 or by 1e-201 when either case is likely to result in no step at all. So don't actually do that.)

like image 139
tmyklebu Avatar answered Sep 30 '22 20:09

tmyklebu