task scheduling of NVIDIA GPU

Tags:

I have some doubt about the task scheduling of nvidia GPU.

(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.

(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU？ In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?

I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!

477

asked May 25 '17 09:05

foxspy

1 Answers

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.

When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.

When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.

Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.

This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.

175

answered Sep 24 '22 17:09

Greg Smith

Related questions
                            
                                How to properly link cuda header file with device functions?
                            
                                Strided reduction by CUDA Thrust
                            
                                Add CUDA to ROS Package
                            
                                How to select a GPU with CUDA?
                            
                                Why am I getting "nvcc fatal : redefinition of argument 'optimize'"?
                            
                                cufft.lib for win32 is missing
                            
                                Could not locate deviceQuery on my installation Cuda toolkit v7.5 on Windows 10
                            
                                How to Get CUDA Toolkit Version at Compile Time Without nvcc?
                            
                                How to remove all PTX from compiled CUDA to prevent Intellectual Property leaks
                            
                                How to convert CUDA clock cycles to milliseconds?
                            
                                CUDA Device To Device transfer expensive
                            
                                CUDA streams and context
                            
                                Is there a good way use a read only hashmap on cuda?
                            
                                Dealing with large switch statements in CUDA
                            
                                Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)
                            
                                How many grids in CUDA
                            
                                GTX 680 , Keplers and maximum registers per thread
                            
                                Scaling in inverse FFT by cuFFT
                            
                                CUDA pow function with integer arguments
                            
                                QR decomposition to solve linear systems in CUDA

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

task scheduling of NVIDIA GPU

Tags:

cuda

gpgpu

gpu

foxspy

People also ask

1 Answers

Greg Smith

Recent Activity

Donate For Us