Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my training loss have regular spikes?

Tags:

I'm training the Keras object detection model linked at the bottom of this question, although I believe my problem has to do neither with Keras nor with the specific model I'm trying to train (SSD), but rather with the way the data is passed to the model during training.

Here is my problem (see image below): My training loss is decreasing overall, but it shows sharp regular spikes:

Training loss

The unit on the x-axis is not training epochs, but tens of training steps. The spikes occur precisely once every 1390 training steps, which is exactly the number of training steps for one full pass over my training dataset.

The fact that the spikes always occur after each full pass over the training dataset makes me suspect that the problem is not with the model itself, but with the data it is being fed during the training.

I'm using the batch generator provided in the repository to generate batches during training. I checked the source code of the generator and it does shuffle the training dataset before each pass using sklearn.utils.shuffle.

I'm confused for two reasons:

  1. The training dataset is being shuffled before each pass.
  2. As you can see in this Jupyter notebook, I'm using the generator's ad-hoc data augmentation features, so the dataset should theoretically never be same for any pass: All the augmentations are random.

I made some test predictions to see if the model is actually learning anything, and it is! The predictions get better over time, but of course the model is learning very slowly since those spikes seem to mess up the gradient every 1390 steps.

Any hints as to what this might be are greatly appreciated! I'm using the exact same Jupyter notebook that is linked above for my training, the only variable I changed is the batch size from 32 to 16. Other than that, the linked notebook contains the exact training process I'm following.

Here is a link to the repository that contains the model:

https://github.com/pierluigiferrari/ssd_keras

like image 398
Alex Avatar asked Dec 15 '17 01:12

Alex


People also ask

How do you create a smooth loss curve?

You can use the Exponential Moving Average method. This method is used in tensorbaord as a way to smoothen a loss curve plot. However there is a small problem doing it this way. As you can see S_t is initialized with the starting value, which makes the starting curve inaccurate.

What does training loss mean?

The training loss is a metric used to assess how a deep learning model fits the training data. That is to say, it assesses the error of the model on the training set. Note that, the training set is a portion of a dataset used to initially train the model.

What is gradient clipping?

Gradient Clipping is a method where the error derivative is changed or clipped to a threshold during backward propagation through the network, and using the clipped gradients to update the weights.

Why does training loss fluctuate?

Reasons for Fluctuations in Loss During Training There are several reasons that can cause fluctuations in training loss over epochs. The main one though is the fact that almost all neural nets are trained with different forms of gradient decent variants such as SGD, Adam etc. which causes oscillations during descent.


2 Answers

I've figured it out myself:

TL;DR:

Make sure your loss magnitude is independent of your mini-batch size.

The long explanation:

In my case the issue was Keras-specific after all.

Maybe the solution to this problem will be useful for someone at some point.

It turns out that Keras divides the loss by the mini-batch size. The important thing to understand here is that it's not the loss function itself that averages over the batch size, but rather the averaging happens somewhere else in the training process.

Why does this matter?

The model I am training, SSD, uses a rather complicated multi-task loss function that does its own averaging (not by the batch size, but by the number of ground truth bounding boxes in the batch). Now if the loss function already divides the loss by some number that is correlated with the batch size, and afterwards Keras divides by the batch size a second time, then all of a sudden the magnitude of the loss value starts to depend on the batch size (to be precise, it becomes inversely proportional to the batch size).

Now usually the number of samples in your dataset is not an integer multiple of the batch size you choose, so the very last mini-batch of an epoch (here I implicitly define an epoch as one full pass over the dataset) will end up containing fewer samples than the batch size. This is what messes up the magnitude of the loss if it depends on the batch size, and in turn messes up the magnitude of gradient. Since I'm using an optimizer with momentum, that messed up gradient continues influencing the gradients of a few subsequent training steps, too.

Once I adjusted the loss function by multiplying the loss by the batch size (thus reverting Keras' subsequent division by the batch size), everything was fine: No more spikes in the loss.

like image 107
Alex Avatar answered Nov 10 '22 00:11

Alex


For anyone working in PyTorch, an easy solution which solves this specific problem is to specify in the DataLoader to drop the last batch:

train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=False,                                            pin_memory=(torch.cuda.is_available()),                                            num_workers=num_workers, drop_last=True) 
like image 21
wprins Avatar answered Nov 10 '22 01:11

wprins