I have a convolutional neural network of which I modified the architecture. I do not have time to retrain and perform a cross-validation (grid search over optimal parameters). I want to intuitively adjust the learning rate.
Should I increase or decrease the learning rate of my RMS (SGD-based) optimiser if:
Well adding more layers/neurons increases the chance of over-fitting. Therefore it would be better if you decrease the learning rate over time. Removing the subsampling layers also increases the number of parameters and again the chance to over-fit. It is highly recommended, proven through empirical results at least, that subsampling layers can help the model learn better significantly. So avoid removing them.
Also I suggest you generate more examples by cropping the images and train the model with those cropped versions too. This works as a regularizer helps the model learn a better distribution of the data. Then you can also increase the number of layers/neurons with less risk of over-fitting.
We all agree that the learning rate can be seen as a way to control overfitting, just like dropout or batch size. But I'm writing this answer because I think the following in Amir's answer and comments is misleading :
adding more layers/neurons increases the chance of over-fitting. Therefore it would be better if you decrease the learning rate over time.
Since adding more layers/nodes to the model makes it prone to over-fitting [...] taking small steps towards the local minima is recommended
It's actually the OPPOSITE! A smaller learning rate will increase the risk of overfitting!
Citing from Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates (Smith & Topin 2018) (a very interesting read btw):
There are many forms of regularization, such as large learning rates, small batch sizes, weight decay, and dropout. Practitioners must balance the various forms of regularization for each dataset and architecture in order to obtain good performance. Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.
So, as Guillaume Chevalier said in his first comment, if you add regularization, decreasing the learning rate might be a good idea if you want to keep the overall amount of regularization constant. But if your goal is to increase the overall amount of regularization, or if you reduced other means of regularization (e.g., decreased dropout, increased batch size), then the learning rate should be increased.
Related (and also very interesting): Don't decay the learning rate, increase the batch size (Smith et al. ICLR'18).
As a short and practical answer, here the learning rate is decreased if the model is more complex, the variable model_size
is approximately the number of neurons per layer:
def rate(self, step = None):
"Implement `lrate` above"
if step is None:
step = self._step
return self.factor * \
(self.model_size ** (-0.5) *
min(step ** (-0.5), step * self.warmup ** (-1.5)))
Source: The Annotated Transformer
Also see: Adam: A Method for Stochastic Optimization
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With