How does TensorFlow allocate GPU memory when performing inference?

Question

I am running FastRCNN w/ a ResNet50 architecture. I load the model checkpoint and do inference like this:

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session() as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Everything seems to be working great. The model takes 0.08s to actually perform inference.

But, I noticed that when I do this my GPU memory usage explodes to 15637MiB / 16280MiB according to nvidia-smi.

I found that you could use the option config.gpu_options.allow_growth to stop Tensorflow from allocating the entire GPU and to instead use GPU memory as needed:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Doing this decrease memory usage down to 4875MiB / 16280MiB. The model still takes 0.08s to run.

Finally, I did this below, where I allocate a fixed amount of memory using per_process_gpu_memory_fraction.

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.05

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Doing this brings usage down to 1331MiB / 16280MiB and the model still takes 0.08s to run.

This begs the question - how is TF allocating memory for models upon inference? If I want to load this model 10 times on the same GPU to perform inference in parallel, will that be an issue?

pinxue · Accepted Answer

Let's ensure what happens in tf.Session(config=config) firstly.

It means use submit the default graph def to tensorflow runtime, the runtime then allocate GPU memory accordingly.

Then Tensorflow will allocate all GPU memory unless you limit it by setting per_process_gpu_memory_fraction. It will fail if cannot allocate the amount of memory unless .gpu_options.allow_growth = True, which tells TF try again to allocate less memory in case of failure, but the iteration always starts with all or fraction portion of GPU memory.

And if you have 10 sessions, each session requires less than 1/10 GPU memory, it should work.

How does TensorFlow allocate GPU memory when performing inference?

Tags:

neural-network

tensorflow

computer-vision

farza

1 Answers

pinxue

Recent Activity

Donate For Us

How does TensorFlow allocate GPU memory when performing inference?

Tags:

neural-network

tensorflow

computer-vision

farza

1 Answers

pinxue

Related questions

Recent Activity

Donate For Us