Starting with zero usage: <pre class="prettyprint"><code>>>> import gc >>> import GPUtil >>> import torch >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 0% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Then I create a big enough tensor and hog the memory: <pre class="prettyprint"><code>>>> x = torch.rand(10000,300,200).cuda() >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 26% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Then I tried several ways to see if the tensor disappears. Attempt 1: Detach, send to CPU and overwrite the variable No, doesn't work. <pre class="prettyprint"><code>>>> x = x.detach().cpu() >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 26% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Attempt 2: Delete the variable No, this doesn't work either <pre class="prettyprint"><code>>>> del x >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 26% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Attempt 3: Use the <code>torch.cuda.empty_cache()</code> function Seems to work, but it seems that there are some lingering overheads... <pre class="prettyprint"><code>>>> torch.cuda.empty_cache() >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 5% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Attempt 4: Maybe clear the garbage collector. No, 5% is still being hogged <pre class="prettyprint"><code>>>> gc.collect() 0 >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 5% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> Attempt 5: Try deleting <code>torch</code> altogether (as if that would work when <code>del x</code> didn't work -_- ) No, it doesn't...* <pre class="prettyprint"><code>>>> del torch >>> GPUtil.showUtilization() | ID | GPU | MEM | ------------------ | 0 | 0% | 5% | | 1 | 0% | 0% | | 2 | 0% | 0% | | 3 | 0% | 0% | </code></pre> And then I tried to check <code>gc.get_objects()</code> and it looks like there's still quite a lot of odd <code>THCTensor</code> stuff inside... Any idea why is the memory still in use after clearing the cache?

It looks like PyTorch's caching allocator reserves some fixed amount of memory even if there are no tensors, and this allocation is triggered by the first CUDA memory access (<code>torch.cuda.empty_cache()</code> deletes unused tensor from the cache, but the cache itself still uses some memory). Even with a tiny 1-element tensor, after <code>del</code> and <code>torch.cuda.empty_cache()</code>, <code>GPUtil.showUtilization(all=True)</code> reports exactly the same amount of GPU memory used as for a huge tensor (and both <code>torch.cuda.memory_cached()</code> and <code>torch.cuda.memory_allocated()</code> return zero).

From the PyTorch docs: <blockquote> <h3>Memory management</h3> PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use <code>memory_allocated()</code> and <code>max_memory_allocated()</code> to monitor memory occupied by tensors, and use <code>memory_cached()</code> and <code>max_memory_cached()</code> to monitor memory managed by the caching allocator. Calling <code>empty_cache()</code> releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch. </blockquote> I bolded a part mentioning nvidia-smi, which as far as I know is used by GPUtil.

Why is the memory in GPU still in use after clearing the object?

Tags:

python

memory-leaks

garbage-collection

gpu

pytorch

Starting with zero usage:

>>> import gc
>>> import GPUtil
>>> import torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Then I create a big enough tensor and hog the memory:

>>> x = torch.rand(10000,300,200).cuda()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Then I tried several ways to see if the tensor disappears.

Attempt 1: Detach, send to CPU and overwrite the variable

No, doesn't work.

>>> x = x.detach().cpu()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 2: Delete the variable

No, this doesn't work either

>>> del x
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% | 26% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 3: Use the torch.cuda.empty_cache() function

Seems to work, but it seems that there are some lingering overheads...

>>> torch.cuda.empty_cache()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 4: Maybe clear the garbage collector.

No, 5% is still being hogged

>>> gc.collect()
0
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

Attempt 5: Try deleting torch altogether (as if that would work when del x didn't work -_- )

No, it doesn't...*

>>> del torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
|  1 |  0% |  0% |
|  2 |  0% |  0% |
|  3 |  0% |  0% |

And then I tried to check gc.get_objects() and it looks like there's still quite a lot of odd THCTensor stuff inside...

Any idea why is the memory still in use after clearing the cache?

218

asked Aug 14 '19 14:08

alvas

2 Answers

It looks like PyTorch's caching allocator reserves some fixed amount of memory even if there are no tensors, and this allocation is triggered by the first CUDA memory access (torch.cuda.empty_cache() deletes unused tensor from the cache, but the cache itself still uses some memory).

Even with a tiny 1-element tensor, after del and torch.cuda.empty_cache(), GPUtil.showUtilization(all=True) reports exactly the same amount of GPU memory used as for a huge tensor (and both torch.cuda.memory_cached() and torch.cuda.memory_allocated() return zero).

129

answered Oct 17 '22 14:10

Sergii Dymchenko

From the PyTorch docs:

Memory management

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_cached() and max_memory_cached() to monitor memory managed by the caching allocator. Calling empty_cache() releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

I bolded a part mentioning nvidia-smi, which as far as I know is used by GPUtil.

answered Oct 17 '22 15:10

Stanowczo

Related questions
                            
                                pandas dataframe sort by date
                            
                                Is it possible to visualize keras embeddings in tensorboard?
                            
                                Sum along axis in numpy array
                            
                                Train multi-class image classifier in Keras
                            
                                Difference between subprocess.Popen preexec_fn and start_new_session in python
                            
                                Running scrapy with PyCharm - Debug works but Run does not work
                            
                                How can I re-upload package to pypi?
                            
                                Efficient pairwise DTW calculation using numpy or cython
                            
                                What is the effect of using pip to install python packages on anaconda?
                            
                                python 3.6 and ValueError: loop argument must agree with Future
                            
                                Python `socket.getaddrinfo` taking 5 seconds about 0.1% of requests
                            
                                How to check whether all values in a column satisfy a condition in Data Frame?
                            
                                RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb
                            
                                How to prevent alphabetical sorting for python bars with matplotlib?
                            
                                pipenv: only works in a installed folder?
                            
                                Understanding Spacy's Scorer Output
                            
                                Pandas dataframe from dictionary of list values
                            
                                insert row with datetime index to dataframe
                            
                                Repeated import path patterns in python
                            
                                PyTorch Binary Classification - same network structure, 'simpler' data, but worse performance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With