Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow ResNet model loading uses **~5 GB of RAM** - while loading from weights uses only ~200 MB

I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.

To predict using the model I did:

  • I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).

  • I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.

  • I saved the weights of the model and the config as json. This takes about 105 MB.

  • I loaded the model from the json config and weights. This takes about ~200 MB of RAM.

  • Compared the predictions of both models. Exactly the same.

  • I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.

Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?

Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..

2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:

I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.

!free

model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used

!free

weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used

!free


optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used

!free

model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used

!free

Loading the model from json:

!free
# loading from json
with open('model_json.json') as f:
    model_json_weights = tf.keras.models.model_from_json(f.read())

model_json_weights.load_weights('model_weights.h5')
!free

# 21132664 - 20971616 = 161.048 MB used
like image 788
Michael S Avatar asked Dec 06 '19 12:12

Michael S


1 Answers

The difference between checkpoint and JSON+Weights is in the optimizer:

  • The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
  • JSON + weights doesn't save the optimizer

Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).

Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.

Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).

Now, the 5GB is not really clear to me. But I suppose that:

  • There should be a lot of compression in saved weights
  • It might have to do with also allocating memory for all the gradient and backpropagation operations

Interesting tests:

  • Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
  • Grandient/Backpropagation: check how much memory is used by:
    • load_model(name, compile=True)
    • load_model(name, compile=False)
like image 168
Daniel Möller Avatar answered Sep 30 '22 04:09

Daniel Möller