Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?
the total consumed GPU memory = GPU memory for parameters x 2 (one for value, one for gradient) + GPU memory for storing forward and backward responses. So the manual calculation would be 4MB (for input) + 64 MB x 2 (for forward and backward) + << 1MB (for parameters).
By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES ) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit TensorFlow to a specific set of GPUs, use the tf.
Maybe tensorflow will decide to store the gradients, then you have to take into account the memory usage of it also. The way I do it is by setting the GPU memory limit to a high value e.g. 1GB, then test the model inference speed.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command. nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage. Here is an example command: nvprof --print-gpu-trace matrixMul
But the GPU memory usage cannot be fully separated according to the model loaded as part of the GPU memory usage are cost by stuff like CUDA context, which is shared among loaded models.
This is a common situation we see — here the system memory is significantly used and the memory usage seems to be gradually increasing. As the memory usage goes up the GPU usage goes down. We also often see network being the bottleneck when people try to train on datasets that aren’t available locally.
(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True), options=run_options, run_metadata=run_metadata) train_writer.add_run_metadata(run_metadata, 'step%03d' % i) train_writer.add_summary(summary, i) print('Adding run metadata for', i) tl = timeline.Timeline(run_metadata.step_stats) print(tl.generate_chrome_trace_format(show_memory=True)) trace_file = tf.gfile.Open(name='timeline', mode='w') trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here: http://docs.nvidia.com/cuda/profiler-users-guide/#abstract
Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto() config.gpu_options.allow_growth=True sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With