I'm new in the field of machine learning and recently I heard about this term. I tried to read some articles in the internet but I still don't understand the idea behind it. Can someone give me some examples?
During back-propagation, we're adjusting the weights of the model to adapt to the most recent training results. On a nicely-behaved surface, we would simply use Newton's method and converge to the optimum solution with no problem. However, reality is rarely well-behaved, especially in the initial chaos of a randomly-initialized model. We need to traverse the space with something less haphazard than a full-scale attempt to hit the optimum on the next iteration (as Newton's method does).
Instead, we make two amendments to Newton's approach.  The first is the learning rate: Newton adjusted weights by using the local gradient to compute where the solution should be, and going straight to that new input value for the next iteration.  Learning rate scales this down quite a bit, taking smaller steps in the indicated direction.  For instance, a learning rate of 0.1 says to go only 10% of the computed distance.  From that new value, we again compute the gradient, "sneaking up" on the solution.  This gives us a better chance of finding the optimum on a varied surface, rather than overshooting or oscillating past it in all directions.
Momentum see here is a similar attempt to maintain a consistent direction.  If we're taking smaller steps, it also makes sense to maintain a somewhat consistent heading through our space.  We take a linear combination of the previous heading vector, and the newly-computed gradient vector, and adjust in that direction.  For instance, if we have a momentum of 0.90, we will take 90% of the previous direction plus 10% of the new direction, and adjust weights accordingly -- multiplying that direction vector by the learning rate.
Does that help?
Momentum is a term used in gradient descent algorithm.
Gradient descent is an optimization algorithm which works by finding the direction of steepest slope in its current status and updates its status by moving towards that direction. As a result, in each step its guaranteed that the value of function to be minimized decreases by each step. The problem is this direction can change greatly in some points of the function while there best path to go usually does not contain a lot of turns. So it's desirable for us to make the algorithm keep the direction it has already been going along for sometime before it changes its direction. To do that the momentum is introduced.
A way of thinking about this is imagining a stone rolling down a hill till it stops at a flat area (a local minimum). If the stone rolling down a hill happens to pass a point where the steepest direction changes for a just moment, we don't expect it to change its direction entirely (since its physical momentum keeps it going). But if the direction of slope changes entirely the stone will gradually change its direction towards the steepest descent again.
here is a detailed link, you might want to check out the math behind it or just see the effect of momentum in action:
https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With