Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fluctuating loss during training for text binary classification

I'm doing a finetuning of a Longformer on a document text binary classification task using Huggingface Trainer class and I'm monitoring the measures of some checkpoints with Tensorboard.

Even if the F1 score and accuracy is quite high, I have perplexities about the fluctuations of training loss.

I read online a reason for that can be:

  • the too high learning rate, but I tried with 3 values (1e-4, 1e-5 and 1e-6) and all of them made the same effect
  • a small batch size. I'm using a Sagemaker notebook p2.8xlarge which has 8xK80 GPUs. The batch size per GPU I can use to avoid the CUDA out of memory error is 1. So the total batch size is 8. My intuition is that a bs of 8 is too small for a dataset containing 57K examples (7K steps per epoch). Unfortunately it's the highest value I can use.

Here I have reported the trend of F1, accuracy, loss and smoothed loss. The grey line is with 1e-6 of learning rate while the pink one is 1e-5.

I reasume all the info of my training:

  • batch size: 1 x 8GPU = 8
  • learning rate: 1e-4, 1e-5, 1e-6 (all of them tested without improvement on loss)
  • model: Longformer
  • dataset:
    • training set: 57K examples
    • dev set: 12K examples
    • test set: 12K examples

Which could be the reason? Can this be considered a problem despite the quite good F1 and accuracy results?

like image 376
Paolo Magnani Avatar asked Nov 06 '22 05:11

Paolo Magnani


1 Answers

I will first tell you the reason for the fluctuations and then a possible way to solve it.

REASON

When you train a network, you calculate a gradient that would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples.

Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient-based on some small set of samples but as a trade-off I lose information regarding the gradient.

Rule of thumb: Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational conplexity.

That is the reason why for smaller batch sizes, you observe varying losses/fluctuations because the gradient is noisy.

SOLUTION: Accumulated Gradients

In case of memory issues, you can use the concept of accumulated gradients to combat the fluctuating loss. It calculates the loss and gradients after each mini-batch, but instead of updating the weights on every batch, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

On this page from the documentation you can find how to apply it: https://huggingface.co/transformers/v1.2.0/examples.html

like image 187
Berkay Berabi Avatar answered Nov 16 '22 12:11

Berkay Berabi