I'm still newbie in tensorflow
and I'm trying to understand what's happenning in details while my models' training goes on. Briefly, I'm using the slim
models pretrained on ImageNet
to do the finetuning
on my dataset. Here are some plots extracted from tensorboard for 2 separate models:
Model_1 (InceptionResnet_V2)
Model_2 (InceptionV4)
So far, both models have poor results on the validation sets (Average Az (Area under the ROC curve) = 0.7 for Model_1
& 0.79 for Model_2
). My interpretation to these plots is that the weights are not changing over the mini-batches. It's only the biases that change over the mini-batches and this might be the problem. But I don't know where to look to verify this point. This is the only interpretation I can think of but it might be wrong considering the fact that I'm still newbie. Can u please share with me your thoughts? Don't hesitate to ask for more plots in case needed.
EDIT: As you can see in the plots below, it seems the weights are barely changing over time. This is applied for all other weights for both networks. This led me to think that there is a problem somewhere but don't know how to interpret it.
InceptionV4 weights
InceptionResnetV2 weights
EDIT2: These models were first trained on ImageNet and these plots are the results of finetuning them on my dataset. I'm using a dataset of 19 classes with roughly 800000 images in it. I'm doing a multi-label classification problem and I'm using sigmoid_crossentropy as a loss function. The classes are highly unbalanced. In the table below, we're showing the percentage of presence of each class in the 2 subsets (train, validation):
Objects train validation
obj_1 3.9832 % 0.0000 %
obj_2 70.6678 % 33.3253 %
obj_3 89.9084 % 98.5371 %
obj_4 85.6781 % 81.4631 %
obj_5 92.7638 % 71.4327 %
obj_6 99.9690 % 100.0000 %
obj_7 90.5899 % 96.1605 %
obj_8 77.1223 % 91.8368 %
obj_9 94.6200 % 98.8323 %
obj_10 88.2051 % 95.0989 %
obj_11 3.8838 % 9.3670 %
obj_12 50.0131 % 24.8709 %
obj_13 0.0056 % 0.0000 %
obj_14 0.3237 % 0.0000 %
obj_15 61.3438 % 94.1573 %
obj_16 93.8729 % 98.1648 %
obj_17 93.8731 % 97.5094 %
obj_18 59.2404 % 70.1059 %
obj_19 8.5414 % 26.8762 %
The values of the hyperparams:
batch_size=32
weight_decay = 0.00004 #'The weight decay on the model weights.'
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9 #'Decay term for RMSProp.'
learning_rate_decay_type = exponential #Specifies how the learning rate is decayed
learning_rate = 0.01 #Initial learning rate.
learning_rate_decay_factor = 0.94 #Learning rate decay factor
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
Concerning the sparsity of the layers, here are some samples of the sparsity of the layers for both networks:
sparsity (InceptionResnet_V2)
sparsity (InceptionV4)
EDITED3: Here are the plots of the losses for both models:
Losses and regularization loss (InceptionResnet_V2)
Losses and regularization loss (InceptionV4)
Select the Graphs dashboard by tapping “Graphs” at the top. You can also optionally use TensorBoard. dev to create a hosted, shareable experiment. By default, TensorBoard displays the op-level graph.
The depth of histogram indicate which values are new. The lighter/front values are newer and darker/far values are older. Values are gathered into buckets which are indicated by those triangle structures. x-axis indicate the range of values where the bunch lies.
The TensorBoard Histogram Dashboard displays how the distribution of some Tensor in your TensorFlow graph has changed over time. It does this by showing many histograms visualizations of your tensor at different points in time.
I agree with your assessment - the weights aren't changing very much across the minibatches. It does appear they are changing somewhat.
As I'm sure you're aware, you are doing fine tuning with very large models. As such, backprop can sometimes take a while. But, you're running many training iterations. I don't really think this is the problem.
If I'm not mistaken, both of these were originally trained on ImageNet. If your images are in a totally different domain than something in ImageNet, that could explain the problem.
The backprop equations do make it easier for biases to change with certain activation ranges. ReLU can be one if the model is highly sparse (i.e. if many layers have activation values of 0, then weights will struggle to adjust but biases will not). Also, if activations are in the range [0, 1]
, the gradient with respect to a weight will be higher than the gradient with respect to a bias. (This is why sigmoid is a bad activation function).
It could also be related to your readout layer - specifically the activation function. How are you calculating error? Is this a classification or regression problem? If at all possible, I recommend using something other than sigmoid as your final activation function. tanh could be marginally better. Linear readout sometimes speeds up training, too (all the gradients have to "pass through" the readout layer. If the derivative of the readout layer is always 1 - linear - you're "letting more gradient through" to adjust the weights further down the model).
Lastly I notice your weights histograms are pushing towards negative weights. Sometimes, especially with models that have a lot of ReLU activation, that can be an indicator of the model learning sparsity. Or an indicator of the dead neuron problem. Or both - the two are somewhat linked.
Ultimately, I think your model is just struggling to learn. I've encountered very similar histograms retraining Inception. I was using a dataset of about 2000 images, and I was struggling to push it over 80% accuracy (as it happens, the dataset was heavily biased - that accuracy was roughly as good as random guessing). It helped when I made the convolution variables constant and only made changes to the fully connected layer.
Indeed this is a classification problem and sigmoid cross entropy is the appropriate activation function. And you do have a sizable dataset - certainly big enough to fine tune these models.
With this new information, I would suggest lowering the initial learning rate. I have a two-fold reasoning here:
(1) is my own experience. As I mentioned, I'm not especially familiar with RMSprop. I've only used it in the context of DNCs (though, DNCs with convolutional controllers), but my experience there backs up what I'm about to say. I think .01
is high for training a model from scratch, let alone fine tuning. It's definitely high for Adam. In some sense, starting with a small learning rate is the "fine" part of fine tuning. Don't force the weights to shift quite so much. Especially if you're adjusting the whole model rather than the last (few) layer(s).
(2) is the increasing sparsity and shift toward negative weights. Based on your sparsity plots (good idea btw), it looks to me like some weights might be getting stuck in a sparse configuration as a result of overcorrection. I.e., as a result of a high initial rate, the weights are "overshooting" their optimal position and getting stuck somewhere that makes it hard for them to recover and contribute to the model. That is, slightly negative and close to zero is not good in a ReLU network.
As I've mentioned (repeatedly) I'm not very familiar with RMSprop. But, since you're already running lots of training iterations, give low, low, low initial rates a shot and work your way up. I mean, see how 1e-8
works. It's possible the model won't respond to training with a rate that low, but do something of an informal hyperparameter search with the learning rate. In my experience with Inception using Adam, 1e-4
to 1e-8
worked well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With