Starting with zero usage:
>>> import gc
>>> import GPUtil
>>> import torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 0% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Then I create a big enough tensor and hog the memory:
>>> x = torch.rand(10000,300,200).cuda()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Then I tried several ways to see if the tensor disappears.
Attempt 1: Detach, send to CPU and overwrite the variable
No, doesn't work.
>>> x = x.detach().cpu()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 2: Delete the variable
No, this doesn't work either
>>> del x
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 3: Use the torch.cuda.empty_cache()
function
Seems to work, but it seems that there are some lingering overheads...
>>> torch.cuda.empty_cache()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 4: Maybe clear the garbage collector.
No, 5% is still being hogged
>>> gc.collect()
0
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 5: Try deleting torch
altogether (as if that would work when del x
didn't work -_- )
No, it doesn't...*
>>> del torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
And then I tried to check gc.get_objects()
and it looks like there's still quite a lot of odd THCTensor
stuff inside...
Any idea why is the memory still in use after clearing the cache?
Memory management. PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi .
It performs a blocking garbage collection of all generations. All objects, regardless of how long they have been in memory, are considered for collection; however, objects that are referenced in managed code are not collected. Use this method to force the system to try to reclaim the maximum amount of available memory.
The garbage collector is keeping track of all objects in memory. A new object starts its life in the first generation of the garbage collector. If Python executes a garbage collection process on a generation and an object survives, it moves up into a second, older generation.
I found that the GPU memory allocated for evaluation couldn’t be cleared (about 4G memory being occupied as illustated in the figure). This is expected, since the memory would be in the cache assuming that you have deleted all tensors. If not delete objects, which are not needed anymore.
Not having enough memory on your graphics card limits the resolution size, textures, shadows, and other graphics settings. Let’s use a simple analogy to help better understand how graphics cards work. Your GPU is like the engine in a car, and the graphics card memory is the passenger space.
GPU usage and memory is not related. Depending on the game, the memory would not be loaded to the max, only the amount that the game needed. Low memory usage is not the indication of slow memory.
AMD cards with this issue have a program called Radeon Settings Host Service that has high GPU usage. There could be other third-party programs that are using your GPU. To check, open the task manager and look at the applications tab for any programs that are using the GPU. You can tap the GPU column to sort the applications by GPU usage.
It looks like PyTorch's caching allocator reserves some fixed amount of memory even if there are no tensors, and this allocation is triggered by the first CUDA memory access
(torch.cuda.empty_cache()
deletes unused tensor from the cache, but the cache itself still uses some memory).
Even with a tiny 1-element tensor, after del
and torch.cuda.empty_cache()
, GPUtil.showUtilization(all=True)
reports exactly the same amount of GPU memory used as for a huge tensor (and both torch.cuda.memory_cached()
and torch.cuda.memory_allocated()
return zero).
From the PyTorch docs:
Memory management
PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use
memory_allocated()
andmax_memory_allocated()
to monitor memory occupied by tensors, and usememory_cached()
andmax_memory_cached()
to monitor memory managed by the caching allocator. Callingempty_cache()
releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.
I bolded a part mentioning nvidia-smi, which as far as I know is used by GPUtil.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With