Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow - GradientDescentOptimizer - are we actually finding global optimum?

I was playing with tensorflow for quite a while, and I have more of a theoretical question. In general, when we train a network we usually use GradientDescentOptimizer (probably its variations like adagrad or adam) to minimize the loss function. In general it looks like we are trying to adjust weights and biases so that we get the global minimum of this loss function. But the issue is that I assume that this function has an extremely complicated look if you plot it, with lots of local optimums. What I wonder is how can we be sure that Gradient Descent finds global optimum and that we are not getting instantly stuck in some local optimum instead which is far away from global optimum?

I recollect that for example when you are performing clustering in sklearn it usually runs clustering algorithm several times with random initialization of cluster centers, and by doing this we ensure that we are not getting stuck with not optimal result. But we are not doing something like this while training ANNs in tensorflow - we start with some random weights and just travel along the slope of the function.

So, any insight into this? Why we can be more or less sure that the results of training via gradient descent are close to global minimum once the loss stops to decrease significantly?

Just to clarify, why I am wondering about this matter is that if we can't be sure that we get at least close to global minimum we can't easily judge which of 2 different models is actually better. Because we could run experiment, get some model evaluation which shows that model is not good... But actually it just stuck in local minimum shortly after training started. While other model which seemed for us to be better was just more lucky to start training from a better starting point and didn't stuck in local minimum fast. Moreover, this issue means that we can't even be sure that we get maximum from the network architecture we currently could be testing. For example, it could have really good global minimum but it is hard to find it and we mostly get stuck with poor solutions at local minimums, which would be far away from global optimum and never see the full potential of network at hand.

like image 667
Maksim Khaitovich Avatar asked Jul 12 '16 15:07

Maksim Khaitovich


2 Answers

Gradient descent, by its nature, is looking at the function locally (local gradient). Hence, there is absolutely no guarantee that it will be the global minima. In fact, it probably will not be unless the function is convex. This is also the reason GD like methods are sensitive to initial position you start from. Having said that, there was a recent paper which said that in high-dimensional solution spaces, the number of maximas/minimas are not as many as previously thought.

Finding global minimas in high dimensional space in a reasonable way seems very much an unsolved problem. However, you might wanna focus more on saddle points rather than minimas. See this post for example:

High level description for saddle point problem

A more detailed paper is here (https://arxiv.org/pdf/1406.2572.pdf)

like image 174
Luca Avatar answered Sep 18 '22 12:09

Luca


Your intuition is quite right. Complex models such as neural networks are typically applied to problems with high dimensional input, where the error surface has a very complex landscape.

Neural networks are not guaranteed to find the global optimum and getting stuck in local minima is a problem where a lot of research has been focussed. If you’re interested in finding out more about this, it would be good to look at techniques such as online learning and momentum, which have traditionally been used to avoid the problem of local minima. However, these techniques in themselves bring further difficulties e.g. integrating online learning is not possible for some optimisation techniques and the addition of a momentum hyper-parameter to the backpropagation algorithm brings further difficulties in training.

A really good video for visualising the influence of momentum (and how it overcomes local minma) during backpropagation can be found here.

Added after question edit - see comments

It’s the aforementioned nature of the problems neural networks are applied to that means we often can’t find a globally optimal solution, because (in the general case) traversing the entire search space for the optimal solution would be intractable using classical computing technology (quantum computers could change this for some problems). As such neural networks are trained to achieve what is hopefully a ‘good’ local optimum.

If you're interested in reading more detailed information about the techniques employed to find good local optima (i.e. something that approximates a global solution) a good paper to read might be this

like image 39
Mark Avatar answered Sep 19 '22 12:09

Mark