I have an application where I split the processing load among the GPUs on a user's system. Basically, there is CPU thread per GPU that initiates a GPU processing interval when triggered periodically by the main application thread.
Consider the following image (generated using NVIDIA's CUDA profiler tool) for an example of a GPU processing interval -- here the application is using a single GPU.
As you can see a big portion of the GPU processing time is consumed by the two sorting operations and I am using the Thrust library for this (thrust::sort_by_key). Also, it looks like thrust::sort_by_key calls a few cudaMallocs under the hood before it starts the actual sort.
Now consider the same processing interval where the application has spread the processing load over two GPUs:
In a perfect world you would expect the 2 GPU processing interval to be exactly half that of the single GPU (because each GPU is doing half the work). As you can see, this it not the case partially because the cudaMallocs seem to take longer when they are called simultaneously (sometimes 2-3 times longer) due to some sort of contention issue. I don't see why this needs to be the case because the memory allocation space for the 2 GPUs are completely independent so there shouldn't be a system-wide lock on cudaMalloc -- a per-GPU lock would be more reasonable.
To prove my hypothesis that the issue is with simultaneous cudaMalloc calls, I created a ridiculously simple program with two CPU threads (for each GPU) each calling cudaMalloc several times. I first ran this program so that the separate threads do not call cudaMalloc at the same time:
You see it takes ~175 microseconds per allocation. Next, I ran the program with the threads calling cudaMalloc simultaneously:
Here, each call took ~538 microseconds or 3 times longer than the previous case! Needless to say, this is slowing down my application tremendously and it stands to reason the issue would only get worse with more than 2 GPUs.
I have noticed this behavior on Linux and Windows. On Linux, I am using Nvidia driver version 319.60 and on Windows I am using the 327.23 version. I am using CUDA toolkit 5.5.
Possible Reason: I am using a GTX 690 in these tests. This card is basically 2 680-like GPUs housed in the same unit. This is the only "multi-GPU" setup I've run, so perhaps the cudaMalloc issue has something to do with some hardware dependence between the 690's 2 GPUs?
The answer is : you can handle every single different CUDA GPU you want. Multiple different graphics cards and multiple different GPUs can be handled by your applications in CUDA, as far as you manage them.
GPUs render images more quickly than a CPU because of its parallel processing architecture, which allows it to perform multiple calculations across streams of data simultaneously. The CPU is the brain of the operation, responsible for giving instructions to the rest of the system, including the GPU(s).
Considering that Unified Memory introduces a complex page fault handling mechanism, the on-demand streaming Unified Memory performance is quite reasonable. Still it's almost 2x slower (5.4GB/s) than prefetching (10.9GB/s) or explicit memory copy (11.4GB/s) for PCIe. The difference is more profound for NVLink.
I will preface this with a disclaimer: I'm not privy to the internals of the NVIDIA driver, so this is somewhat speculative.
The slow down you are seeing is just driver level contention caused by competition from multiple threads calling device malloc simultaneously. Device memory allocation requires a number of OS system calls, as does driver level context switching. There is a non-trivial amount of latency in both operations. It is probable that the extra time you see when two threads try and allocate memory simultaneously is caused by the additional driver latency from switching from one device to another throughout the sequence of system calls required to allocate memory on both devices.
I can think of a few ways you should be able to mitigate this:
sort_by_key
, but the effort of writing your own user
memory manager is non trivial. On the other hand it leaves the rest
of your thrust code intact.In multi-GPU CUBLAS based linear algebra codes I have written, I combined both ideas and wrote a standalone user space device memory manager which works off a one time allocated device memory pool. I found that removing all of overhead cost of intermediate device memory allocations yielded a useful speed up. Your use case might benefit from a similar strategy.
To summarize the problem and a give a possible solution:
The cudaMalloc contention probably stems from driver level contention (possibly due to the need to switch device contexts as talonmies suggestsed) and one could avoid this extra latency in performance critical sections by cudaMalloc-ing and temporary buffers beforehand.
It looks like I probably need to refactor my code so that I am not calling any sorting routine that calls cudaMalloc under the hood (in my case thrust::sort_by_key). The CUB library looks promising in this regard. As a bonus, CUB also exposes a CUDA stream parameter to the user, which could also serve to boost performance.
See CUB (CUDA UnBound) equivalent of thrust::gather for some details on moving from thrust to CUB.
UPDATE:
I backed out the calls to thrust::sort_by_key in favor of cub::DeviceRadixSort::SortPairs.
Doing this shaved milliseconds off my per-interval processing time. Also the multi-GPU contention issue has resolved itself -- offloading to 2 GPUs almost drops the processing time by 50%, as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With