Instruction Level Parallelism (ILP) and out-of-order execution on NVIDIA GPUs

Tags:

nvidia

Do NVIDIA GPUs support out-of-order execution?

My first guess is that they don't contain such expensive hardware. However, when reading the CUDA progamming guide, the guide recommends using Instruction Level Parallelism (ILP) to improve performance.

Isn't ILP a feature that hardware supporting out-of-order execution can take advantage from? Or NVIDIA's ILP simply means compiler-level re-ordering of instructions, hence its order is still fixed at runtime. In other words, just the compiler and/or programmer has to arrange the order of instructions in such a way that ILP can be achieved at runtime through in-order executions?

900

asked Jul 26 '13 12:07

user2188453

1 Answers

Pipelining is a common ILP technique and is for sure implemented on NVidia's GPU. I guess you agree that pipelining doesn't rely on out-of-order execution. Besides, NVidia GPU have multiple warp schedulers from compute capability 2.0 and beyond (2 or 4). If your code has 2 (or more) consecutive and independent instructions in threads (or compiler reorders it that way somehow), you exploit this ILP from scheduler as well.

Here is a well explained question on how 2-wide warp scheduler + pipelining work together. How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

Also checkout Vasily Volkov's presentation on GTC 2010. He experimentally found out how ILP would improve CUDA code performance. http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

In terms of out-of-order execution on GPU, I don't think so. Hardware instruction reordering, speculative execution all those kind of stuff are too expensive to implement per SM, as you are aware. And thread level parallelism can fill in the gap of lacking out-of-order execution. When true dependency is encountered, some other warps can kick in and fill the pipe.

answered Sep 21 '22 14:09

Superspr

Related questions
                            
                                Sorting algorithm with Cuda. Inside or outside kernels?
                            
                                GPU 2D shared memory dynamic allocation
                            
                                Uncrustify command for CUDA kernel
                            
                                CUDA: illegal combination of memory qualifiers
                            
                                Can I prefetch specific data to a specific cache level in a CUDA kernel?
                            
                                Where does Cuda kernel code reside on nvidia GPU?
                            
                                Best strategy for profiling memory usage of my code (open source) and 3rd party code(closed source)
                            
                                Tracking down cuda kernel register usage
                            
                                CUDA constant memory banks
                            
                                No CUDA-capable device is detected
                            
                                Is it possible to call cufft library calls in device function?
                            
                                Is it possible to have a persistent cuda kernel running and communicating with cpu asynchronously ?
                            
                                Is it possible to emulate a GPU for CUDA/OpenCL unit testing purposes?
                            
                                CUDA C using single precision flop on doubles
                            
                                CUDA without CUDA enabled gpu [duplicate]
                            
                                How is GPU and memory utilization defined in nvidia-smi results?
                            
                                Sparse matrix-vector multiplication in CUDA
                            
                                CUDA: cudaEventElapsedTime returns device not ready error
                            
                                Can using kernel parameters cause bank conflicts? [closed]
                            
                                Creating DLL from CUDA using nvcc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With