Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way of determining how much GPU memory is in use by TensorFlow?

Tags:

tensorflow

gpu

Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?

like image 654
Maarten Avatar asked Mar 21 '16 05:03

Maarten


People also ask

How do I check my GPU memory usage?

the total consumed GPU memory = GPU memory for parameters x 2 (one for value, one for gradient) + GPU memory for storing forward and backward responses. So the manual calculation would be 4MB (for input) + 64 MB x 2 (for forward and backward) + << 1MB (for parameters).

How does TensorFlow allocate GPU memory?

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES ) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit TensorFlow to a specific set of GPUs, use the tf.

How to test GPU memory usage in TensorFlow?

Maybe tensorflow will decide to store the gradients, then you have to take into account the memory usage of it also. The way I do it is by setting the GPU memory limit to a high value e.g. 1GB, then test the model inference speed.

How can I measure GPU memory usage?

(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command. nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage. Here is an example command: nvprof --print-gpu-trace matrixMul

Is it possible to separate GPU memory usage from model usage?

But the GPU memory usage cannot be fully separated according to the model loaded as part of the GPU memory usage are cost by stuff like CUDA context, which is shared among loaded models.

Why does my GPU usage go up and down?

This is a common situation we see — here the system memory is significantly used and the memory usage seems to be gradually increasing. As the memory usage goes up the GPU usage goes down. We also often see network being the bottleneck when people try to train on datasets that aren’t available locally.


2 Answers

(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:

    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)     run_metadata = tf.RunMetadata()     summary, _ = sess.run([merged, train_step],                           feed_dict=feed_dict(True),                           options=run_options,                           run_metadata=run_metadata)     train_writer.add_run_metadata(run_metadata, 'step%03d' % i)     train_writer.add_summary(summary, i)     print('Adding run metadata for', i)     tl = timeline.Timeline(run_metadata.step_stats)     print(tl.generate_chrome_trace_format(show_memory=True))     trace_file = tf.gfile.Open(name='timeline', mode='w')     trace_file.write(tl.generate_chrome_trace_format(show_memory=True)) 

You can give this code a try with the MNIST example (mnist with summaries)

This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.

(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.

nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.

Here is an example command: nvprof --print-gpu-trace matrixMul

And more details here: http://docs.nvidia.com/cuda/profiler-users-guide/#abstract

like image 107
Yao Zhang Avatar answered Sep 22 '22 23:09

Yao Zhang


Here's a practical solution that worked well for me:

Disable GPU memory pre-allocation using TF session configuration:

config = tf.ConfigProto()   config.gpu_options.allow_growth=True   sess = tf.Session(config=config)   

run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.

Step through your code with the debugger until you see the unexpected GPU memory consumption.

like image 45
eitanrich Avatar answered Sep 20 '22 23:09

eitanrich