Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are alternatives of Gradient Descent?

Gradient Descent has a problem of Local Minima. We need run gradient descent exponential times for to find global minima.

Can anybody tell me about any alternatives of gradient descent with their pros and cons.

Thanks.

like image 734
Nusrat Avatar asked May 08 '14 23:05

Nusrat


2 Answers

It has been demonstrated that being stuck in a local minima is very unlikely to occur in a high dimensional space because having all derivatives equals to zero in every dimensions is unlikely. (Source Andrew NG Coursera DeepLearning Specialization) That also explain why gradient descent works so well.

like image 114
Sébastien Mougel Avatar answered Sep 23 '22 16:09

Sébastien Mougel


See my masters thesis for a very similar list:

Optimization algorithms for neural networks

  • Gradient based
    • Flavours of gradient descent (only first order gradient):
      • Stochastic gradient descent: enter image description here
      • Mini-Batch gradient descent:
      • Learning Rate Scheduling:
        • Momentum:
        • RProp and the mini-batch version RMSProp
        • AdaGrad
        • Adadelta (paper)
        • Exponential Decay Learning Rate
        • Performance Scheduling
        • Newbob Scheduling
      • Quickprop
      • Nesterov Accelerated Gradient (NAG): Explanation
    • Higher order gradients
      • Newton's method: Typically not possible
      • Quasi-Newton method
        • BFGS
        • L-BFGS
    • Unsure how it works
      • Adam (Adaptive Moment Estimation)
        • AdaMax
      • Conjugate gradient
  • Alternatives
    • Genetic algorithms
    • Simulated Annealing
    • Twiddle
    • Markov random fields (graphcut/mincut)
    • The Simplex algorithm is used for linear optimization in a operations research setting, but aparently also for neural networks (source)

You might also want to have a look at my article about optimization basics and at Alec Radfords nice gifs: 1 and 2, e.g.

Other interesting resources are:

  • An overview of gradient descent optimization algorithms

Trade-Offs

I think all of the posted optimization algorithms have some scenarios where they have advantages. The general trade-offs are:

  • How much of an improvement do you get in one step?
  • How fast can you calculate one step?
  • How much data can the algorithm deal with?
  • Is it guaranteed to find a local minimum?
  • What requirements does the optimization algorithm have for your function? (e.g. to be once, twice or three times differentiable)
like image 27
Martin Thoma Avatar answered Sep 25 '22 16:09

Martin Thoma