I am using a CNN for a regression task. I use Tensorflow and the optimizer is Adam. The network seems to converge perfectly fine till one point where the loss suddenly increases along with the validation error. Here are the loss plots of the labels and the weights separated (Optimizer is run on the sum of them)
I use l2 loss for weight regularization and also for the labels. I apply some randomness on the training data. I am currently trying RSMProp to see if the behavior changes but it takes at least 8h to reproduce the error.
I would like to understand how this can happen. Hope you can help me.
This increase in loss value is due to Adam, the moment the local minimum is exceeded and a certain number of iterations, a small number is divided by an even smaller number and the loss value explodes.
Adam optimizer is an adoptive learning rate optimizer that is very popular for deep learning, especially in computer vision. I have seen some papers that after specific epochs, for example, 50 epochs, they decrease its learning rate by dividing it by 10.
If the model output is constant, the training loss would fluctuate( because of different samples in different batches) whereas the validation loss would remain constant. Maybe try altering the learning rate/batch size or add more data(to validation set as well).
Yes, absolutely. From my own experience, it's very useful to Adam with learning rate decay. Without decay, you have to set a very small learning rate so the loss won't begin to diverge after decrease to a point. Here, I post the code to use Adam with learning rate decay using TensorFlow.
My experience over the last months is the following: Adam is very easy to use because you don't have to play with initial learning rate very much and it almost always works. However, when coming to convergence Adam does not really sattle with a solution but jiggles around at higher iterations. While SGD gives an almost perfectly shaped loss plot and seems to converge much better in higher iterations. But changing litte parts of the setup requires to adjust the SGD parameters or you will end up with NaNs... For experiments on architectures and general approaches I favor Adam, but if you want to get the best version of one chosen architecture you should use SGD and at least compare the solutions.
I also noticed that a good initial SGD setup (learning rate, weight decay etc.) converges as fast as using Adam, at leas for my setup. Hope this may help some of you!
EDIT: Please note that the effects in my initial question are NOT normal even with Adam. Seems like I had a bug but I can't really remember the issue there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With