How good is NVCC at code optimizations?

Tags:

cuda

gpgpu

How well does NVCC optimize device code? Does it do any sort of optimizations like constant folding and common subexpression elimination?

E.g, will it reduce the following:

float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);

to this:

float sqrt_2pi = sqrtf(2 * M_PI); // Compile time constant
float a = 1 / sqrt_2pi;
float b = c / sqrt_2pi;

What about more clever optimizations, involving knowing semantics of math functions:

float a = 1 / sqrtf(c * M_PI);
float b = c / sqrtf(M_PI);

to this:

float sqrt_pi = sqrtf(M_PI); // Compile time constant
float a = 1 / (sqrt_pi * sqrtf(c));
float b = c / sqrt_pi;

747

asked Sep 23 '11 10:09

Alex B

1 Answers

The compiler is way ahead of you. In your example:

float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);

nvopencc (Open64) will emit this:

    mov.f32         %f2, 0f40206c99;        // 2.50663
    div.full.f32    %f3, %f1, %f2;
    mov.f32         %f4, 0f3ecc422a;        // 0.398942

which is equivalent to

float b = c / 2.50663f;
float a = 0.398942f;

The second case gets compiled to this:

float a = 1 / sqrtf(c * 3.14159f); // 0f40490fdb
float b = c / 1.77245f; // 0f3fe2dfc5

I am guessing the expression for a generated by the compiler should be more accurate than your "optmized" version, but about the same speed.

answered Sep 18 '22 20:09

talonmies

Related questions
                            
                                Surface reference faster than Surface object
                            
                                CUDA - how much slower is transferring over PCI-E?
                            
                                Performance of atomic operations on shared memory
                            
                                Double-templated function instantiation fails
                            
                                Mixing C++ flavours in the same project
                            
                                External calls are not supported - CUDA
                            
                                CUDA: bank conflicts between different warps?
                            
                                CUDA: compilation of LLVM IR using NVPTX
                            
                                What is L1 cache used for in NVIDIA's maxwell GPUs?
                            
                                Solving general sparse linear systems in CUDA
                            
                                CUDA estimating threads per blocks and block numbers for 2D grid data
                            
                                Accessing class data members from within cuda kernel - how to design proper host/device interaction?
                            
                                How to generate random number inside pyCUDA kernel?
                            
                                How to use Clang's CUDA compiler?
                            
                                How to turn every bit into a byte
                            
                                CUDA periodic execution time
                            
                                How well do common programming tasks translate to GPUs?
                            
                                CUDA Beginner - Force waiting for a thread to finish before moving on
                            
                                Malloc Memory corruption in C
                            
                                GPU-accelerated hardware simulation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With