CUDA document does not specific how many CUDA process can share one GPU. For example, if I launch more than one CUDA programs by the same user with only one GPU card installed in the system, what is the effect? Will it guarantee the correctness of execution? How does the GPU schedule tasks in this case?
Multiple applications may run at the same time on the same GPU. Namely, multiple applications can have a CUDA context at the same time and launch kernels, copy memory, etc... But kernels from different CUDA contexts cannot be executed simultaneously on the same GPU.
The answer is : you can handle every single different CUDA GPU you want. Multiple different graphics cards and multiple different GPUs can be handled by your applications in CUDA, as far as you manage them. Check the CUDA Faq, section "Hardware and Architecture", and the Multi-GPU slide, both official from Nvidia.
To run multiple instances of a single-GPU application on different GPUs you could use CUDA environment variable CUDA_VISIBLE_DEVICES. The variable restricts execution to a specific set of devices. To use it, just set CUDA_VISIBLE_DEVICES to a comma-separated list of GPU IDs.
CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device.
CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it can and will context-switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for device allocations.) Note that the inter-context switching/scheduling behavior is unspecified and may also vary depending on machine setup. Casual observation or micro-benchmarking may suggest that kernels from separate processes on newer devices can run concurrently (outside of MPS) but this is not correct. Newer machine setups may have a time-sliced rather than round-robin behavior, but this does not change the fact that at any given instant in time, code from only one context can run.
The "exception" to this case (serialization of GPU activity from independent host processes) would be the CUDA Multi-Process Server. In a nutshell, the MPS acts as a "funnel" to collect CUDA activity emanating from several host processes, and run that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise be able to run concurrently. The canonical use-case would be for launching multiple MPI ranks that all intend to use a single GPU resource.
Note that the above description applies to GPUs which are in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes will reject any attempts to create more than one process/context on a single device. In one of these modes, attempts by other processes to use a device already in use will result in a CUDA API reported failure. The compute mode is modifiable in some cases using the nvidia-smi utility.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With