I'm doing a finetuning of a Longformer on a document text binary classification task using Huggingface Trainer class and I'm monitoring the measures of some checkpoints with Tensorboard.
Even if the F1 score and accuracy is quite high, I have perplexities about the fluctuations of training loss.
I read online a reason for that can be:
Here I have reported the trend of F1, accuracy, loss and smoothed loss. The grey line is with 1e-6 of learning rate while the pink one is 1e-5.
I reasume all the info of my training:
Which could be the reason? Can this be considered a problem despite the quite good F1 and accuracy results?
I will first tell you the reason for the fluctuations and then a possible way to solve it.
REASON
When you train a network, you calculate a gradient that would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient-based on some small set of samples but as a trade-off I lose information regarding the gradient.
Rule of thumb: Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational conplexity.
That is the reason why for smaller batch sizes, you observe varying losses/fluctuations because the gradient is noisy.
SOLUTION: Accumulated Gradients
In case of memory issues, you can use the concept of accumulated gradients to combat the fluctuating loss. It calculates the loss and gradients after each mini-batch, but instead of updating the weights on every batch, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.
On this page from the documentation you can find how to apply it: https://huggingface.co/transformers/v1.2.0/examples.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With