Sharing GPU memory between process on a same GPU with Pytorch

Tags:

I'm trying to implement an efficient way of doing concurrent inference in Pytorch.

Right now, I start 2 processes on my GPU (I have only 1 GPU, both process are on the same device). Each process load my Pytorch model and do the inference step.

My problem is that my model takes quite some space on the memory. I have 12Gb of memory on the GPU, and the model takes ~3Gb of memory alone (without the data). Which means together, my 2 processes takes 6Gb of memory just for the model.

Now I was wondering if it's possible to load the model only once, and use this model for inference on 2 different processes. What I want is only 3Gb of memory is consumed by the model, but still have 2 processes.

I came accross this answer mentioning IPC, but as far as I understood it means the process #2 will copy the model from process #1, so I will still end up with 6Gb allocated for the model.

I also checked on the Pytorch documentation, about DataParallel and DistributedDataParallel, but it seems not possible.

This seems to be what I want, but I couldn't find any code example on how to use with Pytorch in inference mode.

I understand this might be difficult to do such a thing for training, but please note I'm only talking about the inference step (the model is in read-only mode, no need to update gradients). With this assumption, I'm not sure if it's possible or not.

645

asked Feb 05 '20 06:02

Astariul

1 Answers

The GPU itself has many threads. When performing an array/tensor operation, it uses each thread on one or more cells of the array. This is why it seems that an op that can fully utilize the GPU should scale efficiently without multiple processes -- a single GPU kernel is already massively parallelized.

In a comment you mentioned seeing better results with multiple processes in a small benchmark. I'd suggest running the benchmark with more jobs to ensure warmup, ten kernels seems like too small of a test. If you're finding a thorough representative benchmark to run faster consistently though, I'll trust good benchmarks over my intuition.

My understanding is that kernels launched on the default CUDA stream get executed sequentially. If you want them to run in parallel, I think you'd need multiple streams. Looking in the PyTorch code, I see code like getCurrentCUDAStream() in the kernels, which makes me think the GPU will still run any PyTorch code from all processes sequentially.

This NVIDIA discussion suggests this is correct:

https://devtalk.nvidia.com/default/topic/1028054/how-to-launch-cuda-kernel-in-different-processes/

Newer GPUs may be able to run multiple kernels in parallel (using MPI?) but it seems like this is just implemented with time slicing under the hood anyway, so I'm not sure we should expect higher total throughput:

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

If you do need to share memory from one model across two parallel inference calls, can you just use multiple threads instead of processes, and refer to the same model from both threads?

To actually get the GPU to run multiple kernels in parallel, you may be able to use nn.Parallel in PyTorch. See the discussion here: https://discuss.pytorch.org/t/how-can-l-run-two-blocks-in-parallel/61618/3

answered Sep 22 '22 18:09

nairbv

Related questions
                            
                                OpenCV unproject 2D points to 3D with known depth `Z`
                            
                                Keras: what does class_weight actually try to balance?
                            
                                Read Files from multiple folders in Apache Beam and map outputs to filenames
                            
                                Python program Airnef stuck while downloading images
                            
                                Error printing variables while debugging Cython
                            
                                github not rendering jupyter notebook python
                            
                                Achieving shell-like pipeline performance in Python
                            
                                Adding StandardScaler() of values as new column to DataFrame returns partly NaNs
                            
                                Does VSCode support Python .pyi files for IntelliSense?
                            
                                Change QDocketWidget hover title bar color with CSS
                            
                                Defining an API for complex View generator function (with many configurables)
                            
                                Using python lime as a udf on spark
                            
                                Singledispatch and type as an input argument
                            
                                Exhaustively get all the possible combinations of a word of three lettters
                            
                                Define an attribute of a dataclass with a reserved word "class" and serialize it
                            
                                mypy error: Callable has no attribute "__get__"
                            
                                Conditional Cumulative Sums in Pandas
                            
                                Infer generic lambda parameters from another generic lambda's parameters
                            
                                How to resolve a Field nested in another Type using Graphene GraphQL?
                            
                                Airflow giving log file does not exist error while running on Docker

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sharing GPU memory between process on a same GPU with Pytorch

Tags:

python

gpu

pytorch

inference

Astariul

People also ask

1 Answers

nairbv

Recent Activity

Donate For Us