What does "persistence mode" actually do which reduces CUDA startup time?

Tags:

cuda

Starting up the CUDA runtime takes a certain amount of time to harmonize the UVM memory maps of the device and the host; see:

cudaGetCacheConfig takes 0.5 seconds - how/why?
slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice

Now, it's been suggested to me that using Persistence Mode would mitigate this phenomenon significantly. In what way? I mean, what will happen, or fail to happen, when persistence mode is on, and a process using CUDA exists?

The documentation says:

Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.

but - what does "keeping initialized" mean? Later, the section about the persistence daemon (which is not the same thing as persistence mode) says:

The GPU state remains loaded in the driver whenever one or more clients have the device file open. Once all clients have closed the device file, the GPU state will be unloaded unless persistence mode is enabled.

So what exactly is unloaded? To where it is unloaded? How does it relate to the memory size? And why would it take so much time to load it back if nothing significant has happend on the system?

313

asked Jul 27 '17 20:07

einpoklum

1 Answers

There are 2 major pieces to the GPU/CUDA start-up sequence:

Device initialization time
CUDA context "lazy" initialization

A modern CUDA GPU can exist in one of several power states. The current power state is observable via nvidia-smi or via NVML (although note that the effect of running a tool like nvidia-smi may modify the power state of a GPU.) When the GPU is not being used for any purpose (i.e. it is idle, technically: no contexts of any kind are instantiated on the GPU) and persistence mode is not enabled, the GPU, in concert with the GPU driver, will automatically reduce its power state to a very low level, sometimes including a complete power-off scenario.

The process of moving a GPU to a lower power state will involve shutting off or modifying the behavior of various pieces of hardware. For example, reducing memory clocks, reducing core clocks, shutting off display output, shutting off the memory subsystem, shutting off various internal subsystems such as clock generators, and even major parts of the chip, such as the compute cores, caches, etc. and potentially even a "complete" power-down of the chip. A modern GPU has a controllable power delivery system, both on-chip and off-chip, to enable this behavior.

To reverse this process, the GPU driver software must carefully (in a prescribed sequence) power up modules, wait for a hardware settling time, then apply a module-level reset, then begin initializing controlling registers in the module. For example, powering up memory would involve, amongst other things, turning on the on-chip DRAM control module, turning on DRAM power, turning on the memory pin drivers, setting slew rates, turning on the memory clock, initializing the memory clock generator PLL for desired operation, and in many cases, initializing memory to some known state. For example, proper ECC usage requires that memory be initialized to a known state, which may not be simply all zeroes, but involves ECC tags which must be computed and stored. This "ECC Scrub" is one example of a "time-consuming" process mentioned in the documentation.

Depending on the exact power state, there may be any number of things that the driver must do to bring the GPU to the next higher power state (or "performance state"), P0 being the highest state. Once the perf state is above a certain level (say, P8) then the GPU may be capable of supporting certain types of contexts (e.g. a compute context) but perhaps at a reduced performance level (unless you are at P0).

These operations take time, and persistence mode will generally keep the GPU at power/perf state P2 or P0, meaning that essentially none of the above steps must be performed if it is desired that a context be opened on the GPU.

However, opening a GPU context may involve start-up costs of its own, that the GPU cannot or does not keep track of. For example, opening a compute context in a UVA regime requires, among other things, that "virtual allocations" be requested of the host OS, and that the memory maps of all processors in the system (all "visible" GPUs, plus the CPU) be "harmonized" so that everyone has a unique space to work in, and the numerical value of a 64-bit pointer in the space can be used to uniquely determine "ownership" or "meaning/introspection" of that pointer.

For the most part, activities related to opening a CUDA context (other than the process of bringing the device to a state where it can support a context) will not be impacted or benefitted by having the GPU in persistence mode.

Since both device initialization, and CUDA context creation may impact perceived "CUDA startup time", then persistence mode may improve/mitigate the overall perceived start-up time, but it cannot reduce it to zero, since some activities associated with context creation are outside of its purview.

The exact behavior of persistence mode may vary over time and by GPU type. Recently, it seems that persistence mode may still allow GPUs to move down to a power state of P8.

answered Sep 30 '22 01:09

Robert Crovella

Related questions
                            
                                cudaArray vs. device pointer
                            
                                Having Open MPI related issues while making CUDA 5.0 samples (Mac OS X ML)
                            
                                The different addressing modes of CUDA textures
                            
                                Using constants with CUDA
                            
                                Cannot launch Nvidia nsight
                            
                                Unresolved external symbols in beginners CUDA program
                            
                                Implementing a critical section in CUDA
                            
                                creating arrays in nvidia cuda kernel
                            
                                Feasibility of GPU as a CPU? [closed]
                            
                                CUDA: synchronizing threads
                            
                                How do I use atomicMax on floating-point values in CUDA?
                            
                                Why transposing a CUDA grid (but not its threadblocks) still slowdowns computation?
                            
                                Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA
                            
                                How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?
                            
                                What is the best algorithm for this array-comparison problem?
                            
                                __forceinline__ effect at CUDA C __device__ functions
                            
                                Compile cuda code for CPU
                            
                                Simple CUBLAS Matrix Multiplication Example?
                            
                                CUDA small kernel 2d convolution - how to do it
                            
                                Branch and predicated instructions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With