I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.
I've read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.
As a test I've created a time series function of 100 points, such that x = [0..99]
and y = f(x)
. I've created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.
I have 100 input points so when I calculate the weight change dw_{ij}
for each node it is actually a sum:
dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}
where p = 100
in this case.
Now the weight updates become really huge and therefore my error E
bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y
to something like 0.7 / p^2
.
Is there some general rule for setting the learning rate, based on the amount of samples?
“Increasing batch size” replaces learning rate decay by batch size increases. “Increased initial learning rate” additionally increases the initial learning rate from 0.1 to 0.5. Finally “Increased momentum coefficient” also increases the momentum coefficient from 0.9 to 0.98.
For the ones unaware, general rule is “bigger batch size bigger learning rate”. This is just logical because bigger batch size means more confidence in the direction of your “descent” of the error surface while the smaller a batch size is the closer you are to “stochastic” descent (batch size 1).
The model weights will be updated after each batch of five samples. This also means that one epoch will involve 40 batches or 40 updates to the model. With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.
When learning gradient descent, we learn that learning rate and batch size matter. Specifically, increasing the learning rate speeds up the learning of your model, yet risks overshooting its minimum loss. Reducing batch size means your model uses fewer samples to calculate the loss in each iteration of learning.
http://francky.me/faqai.php#otherFAQs :
Subject: What learning rate should be used for backprop?
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate makes the weights and objective function diverge, so there is no learning at all. If the objective function is quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs with hidden units, the optimal learning rate often changes dramatically during the training process, since the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a tedious process requiring much trial and error. For some examples of how the choice of learning rate and momentum interact with numerical condition in some very simple networks, see ftp://ftp.sas.com/pub/neural/illcond/illcond.html
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist (see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned under "What are conjugate gradients, Levenberg-Marquardt, etc.?").
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a large step size; this happens when you initialize a network with small random weights. In other regions of the weight space, the gradient is small and you need a small step size; this happens when you are close to a local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.
With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the learning rate during training. Various proposals have appeared in the NN literature, but most of them don't work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate. There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging" (Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running average of the weight values. I have no personal experience with these methods; if you have any solid evidence that these or other methods of automatically setting the learning rate and/or momentum in incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer ([email protected]).
References:
Credits:
A simple solution would be to take the average weight of a batch instead of summing it. This way you can just use a learning rate of 0.7 (or any other value of your liking), without having to worry about optimizing yet another parameter.
More interesting information about batch updating and learning rates can be found in this article by Wilson (2003).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With