Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase or decrease learning rate for adding neurons or weights?

I have a convolutional neural network of which I modified the architecture. I do not have time to retrain and perform a cross-validation (grid search over optimal parameters). I want to intuitively adjust the learning rate.

Should I increase or decrease the learning rate of my RMS (SGD-based) optimiser if:

  1. I add more neurons to the fully connected layers?
  2. on a convolutional neural network, I remove a sub-sampling (average or max pooling) layer before the full connections, and I increase the amount of fully connected units between that feature map and the softmax outputs (so that there are more weights connected to the fully connected neurons on top)?
like image 557
Guillaume Chevalier Avatar asked Dec 27 '15 00:12

Guillaume Chevalier


3 Answers

Well adding more layers/neurons increases the chance of over-fitting. Therefore it would be better if you decrease the learning rate over time. Removing the subsampling layers also increases the number of parameters and again the chance to over-fit. It is highly recommended, proven through empirical results at least, that subsampling layers can help the model learn better significantly. So avoid removing them.

Also I suggest you generate more examples by cropping the images and train the model with those cropped versions too. This works as a regularizer helps the model learn a better distribution of the data. Then you can also increase the number of layers/neurons with less risk of over-fitting.

like image 98
Amir Avatar answered Oct 06 '22 00:10

Amir


We all agree that the learning rate can be seen as a way to control overfitting, just like dropout or batch size. But I'm writing this answer because I think the following in Amir's answer and comments is misleading :

  • adding more layers/neurons increases the chance of over-fitting. Therefore it would be better if you decrease the learning rate over time.

  • Since adding more layers/nodes to the model makes it prone to over-fitting [...] taking small steps towards the local minima is recommended

It's actually the OPPOSITE! A smaller learning rate will increase the risk of overfitting!

Citing from Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates (Smith & Topin 2018) (a very interesting read btw):

There are many forms of regularization, such as large learning rates, small batch sizes, weight decay, and dropout. Practitioners must balance the various forms of regularization for each dataset and architecture in order to obtain good performance. Reducing other forms of regularization and regularizing with very large learning rates makes training significantly more efficient.

So, as Guillaume Chevalier said in his first comment, if you add regularization, decreasing the learning rate might be a good idea if you want to keep the overall amount of regularization constant. But if your goal is to increase the overall amount of regularization, or if you reduced other means of regularization (e.g., decreased dropout, increased batch size), then the learning rate should be increased.

Related (and also very interesting): Don't decay the learning rate, increase the batch size (Smith et al. ICLR'18).

like image 42
Antoine Avatar answered Oct 05 '22 23:10

Antoine


As a short and practical answer, here the learning rate is decreased if the model is more complex, the variable model_size is approximately the number of neurons per layer:

def rate(self, step = None):
    "Implement `lrate` above"
    if step is None:
        step = self._step
    return self.factor * \
        (self.model_size ** (-0.5) *
        min(step ** (-0.5), step * self.warmup ** (-1.5)))

Source: The Annotated Transformer

Also see: Adam: A Method for Stochastic Optimization

like image 38
Guillaume Chevalier Avatar answered Oct 05 '22 23:10

Guillaume Chevalier