Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: FAST multivariate optimization packages?

Tags:

optimization

r

I am looking to find a local minimum of a scalar function of 4 variables, and I have range-constraints on the variables ("box constraints"). There's no closed-form for the function derivative, so methods needing an analytical derivative function are out of the question. I've tried several options and control parameters with the optim function, but all of them seem very slow. Specifically, they seem to spend a lot of time between calls to my (R-defined) objective function, so I know the bottleneck is not my objective function but the "thinking" between calls to my objective function. I looked at CRAN Task View for optimization and tried several of those options (DEOptim from RcppDE, etc) but none of them seem any good. I would have liked to try the nloptr package (an R wrapper for NLOPT library) but it seems to be unavailable for windows.

I'm wondering, are there any good, fast optimization packages that people use that I may be missing? Ideally these would be in the form of thin wrappers around good C++/Fortran libraries, so there's minimal pure-R code. (Though this shouldn't be relevant, my optimization problem arose while trying to fit a 4-parameter distribution to a set of values, by minimizing a certain goodness-of-fit measure).

In the past I've found R's optimization libraries to be quite slow, and ended up writing a thin R wrapper calling a C++ API of a commercial optimization library. So are the best libraries necessarily commercial ones?

UPDATE. Here is a simplified example of the code I'm looking at:

###########
## given a set of values x and a cdf, calculate a measure of "misfit":
## smaller value is better fit
## x is assumed sorted in non-decr order;
Misfit <- function(x, cdf) {
  nevals <<- nevals + 1
  thinkSecs <<- thinkSecs + ( Sys.time() - snapTime)
  cat('S')
  if(nevals %% 20 == 0) cat('\n')
  L <- length(x)
  cdf_x <- pmax(0.0001, pmin(0.9999, cdf(x)))
  measure <- -L - (1/L) * sum( (2 * (1:L)-1 )* ( log( cdf_x ) + log( 1 - rev(cdf_x))))
  snapTime <<- Sys.time()
  cat('E')
  return(measure)  
}
## Given 3 parameters nu (degrees of freedom, or shape), 
## sigma (dispersion), gamma (skewness),
## returns the corresponding 4-parameter student-T cdf parametrized by these params
## (we restrict the location parameter mu to be 0).
skewtGen <- function( p ) {
  require(ghyp)
  pars = student.t( nu = p[1], mu = 0, sigma = p[2], gamma = p[3] )
  function(z) pghyp(z, pars)
}

## Fit using optim() and BFGS method
fit_BFGS <- function(x, init = c()) {
  x <- sort(x)
  nevals <<- 0
  objFun <- function(par) Misfit(x, skewtGen(par))
  snapTime <<- Sys.time() ## global time snap shot
  thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
  tUser <- system.time(
              res <- optim(init, objFun,
                           lower = c(2.1, 0.1, -1), upper = c(15, 2, 1),
                           method = 'L-BFGS-B',
                           control = list(trace=2, factr = 1e12, pgtol = .01 )) )[1]
  cat('Total time = ', tUser, 
      ' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
  cat('results:\n')
  print(res$par)
}

fit_DE <- function(x) {
  x <- sort(x)
  nevals <<- 0
  objFun <- function(par) Misfit(x, skewtGen(par))
  snapTime <<- Sys.time() ## global time snap shot
  thinkSecs <<- 0 ## secs spent "thinking" between objFun calls
  require(RcppDE)
  tUser <- system.time(
              res <- DEoptim(objFun,
                             lower = c(2.1, 0.1, -1),
                             upper = c(15, 2, 1) )) [1]
  cat('Total time = ',             tUser,
      ' secs, ObjFun Time Pct = ', 100*(1 - thinkSecs/tUser), '\n')
  cat('results:\n')
  print(res$par)
}

Let's generate a random sample:

set.seed(1)
# generate 1000 standard-student-T points with nu = 4 (degrees of freedom)
x <- rt(1000,4)

First fit using the fit.tuv (for "T UniVariate") function in the ghyp package -- this uses the Max-likelihood Expectation-Maximization (E-M) method. This is wicked fast!

require(ghyp)
> system.time( print(unlist( pars <- coef( fit.tuv(x, silent = TRUE) ))[c(2,4,5,6)]))
         nu          mu       sigma       gamma 
 3.16658356  0.11008948  1.56794166 -0.04734128 
   user  system elapsed 
   0.27    0.00    0.27 

Now I am trying to fit the distribution a different way: by minimizing the "misfit" measure defined above, using the standard optim() function in base R. Note that the results will not in general be the same. My reason for doing this is to compare these two results for a whole class of situations. I pass in the above Max-Likelihood estimate as the starting point for this optimization.

> fit_BFGS( x, init = c(pars$nu, pars$sigma, pars$gamma) )
N = 3, M = 5 machine precision = 2.22045e-16
....................
....................
.........
iterations 5
function evaluations 7
segments explored during Cauchy searches 7
BFGS updates skipped 0
active bounds at final generalized Cauchy point 0
norm of the final projected gradient 0.0492174
final function value 0.368136

final  value 0.368136 
converged
Total time =  41.02  secs, ObjFun Time Pct =  99.77084 
results:
[1] 3.2389296 1.5483393 0.1161706

I also tried to fit with the DEoptim() but it ran for too long and I had to kill it. As you can see from the output above, 99.8% of the time is attributable to the objective function! So Dirk and Mike were right in their comments below. I should have more carefully estimated the time spent in my objective function, and printing dots was not a good idea! Also I suspect the MLE(E-M) method is very fast because it uses an analytical (closed-form) for the log-likelihood function.

like image 986
Prasad Chalasani Avatar asked Feb 16 '11 22:02

Prasad Chalasani


1 Answers

A maximum likelihood estimator, when it exists for your problem, will always be faster than a global optimizer, in any language.

A global optimizer, no matter the algorithm, typically combines some random jumps with local minimization routines. Different algorithms may discuss this in terms of populations (genetic algorithms), annealing, migration, etc. but they are all conceptually similar.

In practice, this means that if you have a smooth function, some other optimization algorithm will likely be fastest. The characteristics of your problem function will dictate whether that will be a quadratic, linear, conical, or some other type of optimization problem for which an exact (or near-exact) analytical solution exists, or whether you will need to apply a global optimizer that is necessarily slower.

By using ghyp, you're saying that your 4 variable function produces an output that may be fit to the generalized hyperbolic distribution, and you are using a maximum likelihood estimator to find the closest generalized hyperbolic distribution to the data you've provided. But if you are doing that, I'm afraid I don't understand how you could have a non-smooth surface requiring optimization.

In general, the optimizer you choose needs to be chosen based on your problem. There is no perfect 'optimal optimizer', in any programming language, and choice of optimization algorithm to fit your problem will likely make more of a difference than any minor inefficiencies of the implementation.

like image 186
Brian G. Peterson Avatar answered Nov 06 '22 04:11

Brian G. Peterson