Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Expected tensorflow model size from learned variables

When training convolutional neural networks for image classification tasks we generally want our algorithm to learn the filters (and biases) that transform a given image to its correct label. I have a few models I'm trying to compare in terms of model size, number of operations, accuracy, etc. However, the size of the model outputed from tensorflow, concretely the model.ckpt.data file that stores the values of all the variables in the graph, is not the one I expected. In fact, it seems to be three times bigger.

To go straight to the problem I'm gonna base my question on this Jupyter notebook. Below is the section where the variables (weights and biases) are defined:

# Store layers weight & bias
weights = {
# 5x5 conv, 1 input, 32 outputs
'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32]),dtype=tf.float32),
# 5x5 conv, 32 inputs, 64 outputs
'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64]),dtype=tf.float32),
# fully connected, 7*7*64 inputs, 1024 outputs
'wd1': tf.Variable(tf.random_normal([7*7*64, 1024]),dtype=tf.float32),
# 1024 inputs, 10 outputs (class prediction)
'out': tf.Variable(tf.random_normal([1024, num_classes]),dtype=tf.float32)
}

biases = {
'bc1': tf.Variable(tf.random_normal([32]),dtype=tf.float32),
'bc2': tf.Variable(tf.random_normal([64]),dtype=tf.float32),
'bd1': tf.Variable(tf.random_normal([1024]),dtype=tf.float32),
'out': tf.Variable(tf.random_normal([num_classes]),dtype=tf.float32)
}

I've added a couple of lines in order to save the model at the end of the training process:

# Save the model
save_path = saver.save(sess, logdir+"model.ckpt")
print("Model saved in file: %s" % save_path)

Adding up all those variables we would expect to get a model.ckpt.data file of size 12.45Mb (I've obtained this by just computing the number of float elements that our model learns and then convert that value to MegaBytes). But! the .data file saved is 39.3Mb. Why is this?

I've followed the same approach with a more complex network (a variation of ResNet) and my expected model.data size is also ~3x smaller than what the actual .data file is.

The data type of all these variables is float32.

like image 358
karl71 Avatar asked Nov 15 '17 17:11

karl71


1 Answers

Adding up all those variables we would expect to get a model.ckpt.data file of size 12.45Mb

Traditionally, most of model parameters are in the first fully connected layer, in this case wd1. Computing only its size yields:

7*7*128 * 1024 * 4 = 25690112

... or 25.6Mb. Note 4 coefficient, because the variable dtype=tf.float32, i.e. 4 bytes per parameter. Other layers also contribute to the model size, but not so drastically.

As you can see, your estimate 12.45Mb is a bit off (did you use 16bit per param?). The checkpoint also stores some general information, hence the overhead around 25%, which is still big, but not 300%.

[Update]

The model in question actually has FC1 layer of shape [7*7*64, 1024], as was clarified. So the calculated above size should be 12.5Mb, indeed. That made me look into the saved checkpoint more carefully.

After inspecting it, I noticed other big variables that I missed originally:

...
Variable_2 (DT_FLOAT) [3136,1024]
Variable_2/Adam (DT_FLOAT) [3136,1024]
Variable_2/Adam_1 (DT_FLOAT) [3136,1024]
...

The Variable_2 is exactly wd1, but there are 2 more copies for the Adam optimizer. These variables are created by the Adam optimizer, they're called slots and hold the m and v accumulators for all trainable variables. Now the total size makes sense.

You can run the following code to compute the total size of the graph variables - 37.47Mb:

var_sizes = [np.product(list(map(int, v.shape))) * v.dtype.size
             for v in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)]
print(sum(var_sizes) / (1024 ** 2), 'MB')

So the overhead is actually pretty small. Extra size is due to the optimizer.

like image 85
Maxim Avatar answered Nov 14 '22 23:11

Maxim