Interpreting tensorboard plots

Tags:

I'm still newbie in tensorflow and I'm trying to understand what's happenning in details while my models' training goes on. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. Here are some plots extracted from tensorboard for 2 separate models:

Click to copy

Model_1 (InceptionResnet_V2)

Inception resnet V2

Click to copy

Model_2 (InceptionV4)

InceptionV4

So far, both models have poor results on the validation sets (Average Az (Area under the ROC curve) = 0.7 for Model_1 & 0.79 for Model_2). My interpretation to these plots is that the weights are not changing over the mini-batches. It's only the biases that change over the mini-batches and this might be the problem. But I don't know where to look to verify this point. This is the only interpretation I can think of but it might be wrong considering the fact that I'm still newbie. Can u please share with me your thoughts? Don't hesitate to ask for more plots in case needed.

EDIT: As you can see in the plots below, it seems the weights are barely changing over time. This is applied for all other weights for both networks. This led me to think that there is a problem somewhere but don't know how to interpret it.

Click to copy

InceptionV4 weights

InceptionV4 weights

Click to copy

InceptionResnetV2 weights

InceptionResnetV2 weights

EDIT2: These models were first trained on ImageNet and these plots are the results of finetuning them on my dataset. I'm using a dataset of 19 classes with roughly 800000 images in it. I'm doing a multi-label classification problem and I'm using sigmoid_crossentropy as a loss function. The classes are highly unbalanced. In the table below, we're showing the percentage of presence of each class in the 2 subsets (train, validation):

Click to copy

Objects     train       validation
obj_1       3.9832 %    0.0000 %
obj_2       70.6678 %   33.3253 %
obj_3       89.9084 %   98.5371 %
obj_4       85.6781 %   81.4631 %
obj_5       92.7638 %   71.4327 %
obj_6       99.9690 %   100.0000 %
obj_7       90.5899 %   96.1605 %
obj_8       77.1223 %   91.8368 %
obj_9       94.6200 %   98.8323 %
obj_10      88.2051 %   95.0989 %
obj_11      3.8838 %    9.3670 %
obj_12      50.0131 %   24.8709 %
obj_13      0.0056 %    0.0000 %
obj_14      0.3237 %    0.0000 %
obj_15      61.3438 %   94.1573 %
obj_16      93.8729 %   98.1648 %
obj_17      93.8731 %   97.5094 %
obj_18      59.2404 %   70.1059 %
obj_19      8.5414 %    26.8762 %

The values of the hyperparams:

Click to copy

batch_size=32
weight_decay = 0.00004 #'The weight decay on the model weights.'
optimizer = rmsprop
rmsprop_momentum = 0.9 
rmsprop_decay = 0.9 #'Decay term for RMSProp.'

learning_rate_decay_type =  exponential #Specifies how the learning rate is decayed
learning_rate =  0.01 #Initial learning rate.
learning_rate_decay_factor = 0.94 #Learning rate decay factor
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate

Concerning the sparsity of the layers, here are some samples of the sparsity of the layers for both networks:

Click to copy

sparsity (InceptionResnet_V2)

enter image description here

Click to copy

sparsity (InceptionV4)

enter image description here

EDITED3: Here are the plots of the losses for both models:

Click to copy

Losses and regularization loss (InceptionResnet_V2)

enter image description here

Click to copy

Losses and regularization loss (InceptionV4)

enter image description here

414

asked Dec 28 '17 18:12

Maystro

1 Answers

I agree with your assessment - the weights aren't changing very much across the minibatches. It does appear they are changing somewhat.

As I'm sure you're aware, you are doing fine tuning with very large models. As such, backprop can sometimes take a while. But, you're running many training iterations. I don't really think this is the problem.

If I'm not mistaken, both of these were originally trained on ImageNet. If your images are in a totally different domain than something in ImageNet, that could explain the problem.

The backprop equations do make it easier for biases to change with certain activation ranges. ReLU can be one if the model is highly sparse (i.e. if many layers have activation values of 0, then weights will struggle to adjust but biases will not). Also, if activations are in the range [0, 1], the gradient with respect to a weight will be higher than the gradient with respect to a bias. (This is why sigmoid is a bad activation function).

It could also be related to your readout layer - specifically the activation function. How are you calculating error? Is this a classification or regression problem? If at all possible, I recommend using something other than sigmoid as your final activation function. tanh could be marginally better. Linear readout sometimes speeds up training, too (all the gradients have to "pass through" the readout layer. If the derivative of the readout layer is always 1 - linear - you're "letting more gradient through" to adjust the weights further down the model).

Lastly I notice your weights histograms are pushing towards negative weights. Sometimes, especially with models that have a lot of ReLU activation, that can be an indicator of the model learning sparsity. Or an indicator of the dead neuron problem. Or both - the two are somewhat linked.

Ultimately, I think your model is just struggling to learn. I've encountered very similar histograms retraining Inception. I was using a dataset of about 2000 images, and I was struggling to push it over 80% accuracy (as it happens, the dataset was heavily biased - that accuracy was roughly as good as random guessing). It helped when I made the convolution variables constant and only made changes to the fully connected layer.

Indeed this is a classification problem and sigmoid cross entropy is the appropriate activation function. And you do have a sizable dataset - certainly big enough to fine tune these models.

With this new information, I would suggest lowering the initial learning rate. I have a two-fold reasoning here:

(1) is my own experience. As I mentioned, I'm not especially familiar with RMSprop. I've only used it in the context of DNCs (though, DNCs with convolutional controllers), but my experience there backs up what I'm about to say. I think .01 is high for training a model from scratch, let alone fine tuning. It's definitely high for Adam. In some sense, starting with a small learning rate is the "fine" part of fine tuning. Don't force the weights to shift quite so much. Especially if you're adjusting the whole model rather than the last (few) layer(s).

(2) is the increasing sparsity and shift toward negative weights. Based on your sparsity plots (good idea btw), it looks to me like some weights might be getting stuck in a sparse configuration as a result of overcorrection. I.e., as a result of a high initial rate, the weights are "overshooting" their optimal position and getting stuck somewhere that makes it hard for them to recover and contribute to the model. That is, slightly negative and close to zero is not good in a ReLU network.

As I've mentioned (repeatedly) I'm not very familiar with RMSprop. But, since you're already running lots of training iterations, give low, low, low initial rates a shot and work your way up. I mean, see how 1e-8 works. It's possible the model won't respond to training with a rate that low, but do something of an informal hyperparameter search with the learning rate. In my experience with Inception using Adam, 1e-4 to 1e-8 worked well.

190

answered Oct 15 '22 12:10

Dylan F

Related questions
                            
                                Using categorical data as features in sklean LogisticRegression
                            
                                Swift Optionals (or Haskell's Maybe) in Python?
                            
                                PyMySQL different updates in one query?
                            
                                In python, is math.acos() faster than numpy.arccos() for scalars?
                            
                                What is the difference between a NumPy array and a python list? [duplicate]
                            
                                Python - SQLite JSON1 load extension
                            
                                Why is random.sample faster than numpy's random.choice?
                            
                                why is len so much more efficient on DataFrame than on underlying numpy array?
                            
                                Shebangs in conda managed environments
                            
                                Average face - algorithm
                            
                                Identify clusters linked by delta to the left and different delta to the right
                            
                                Tensorflow: How to use a trained model in a application?
                            
                                Python - reference inner class from other inner class
                            
                                How to do Onehotencoding in Sklearn Pipeline
                            
                                How to create a rpm for python application
                            
                                Celery Task Priority
                            
                                Importing from python modules inside parent directory into jupyter notebook files inside subdirectory
                            
                                setup.py -- configuration for private / commercial projects
                            
                                Using Marshmallow without repeating myself
                            
                                Why datetime.now() and datetime.today() show time in UTC and not local time on my PC?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Interpreting tensorboard plots

Tags:

python

tensorflow

tensorboard

tensorflow-slim

Maystro

People also ask

1 Answers

Dylan F

Recent Activity

Donate For Us