Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How good is NVCC at code optimizations?

Tags:

cuda

gpgpu

How well does NVCC optimize device code? Does it do any sort of optimizations like constant folding and common subexpression elimination?

E.g, will it reduce the following:

float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);

to this:

float sqrt_2pi = sqrtf(2 * M_PI); // Compile time constant
float a = 1 / sqrt_2pi;
float b = c / sqrt_2pi;

What about more clever optimizations, involving knowing semantics of math functions:

float a = 1 / sqrtf(c * M_PI);
float b = c / sqrtf(M_PI);

to this:

float sqrt_pi = sqrtf(M_PI); // Compile time constant
float a = 1 / (sqrt_pi * sqrtf(c));
float b = c / sqrt_pi;
like image 747
Alex B Avatar asked Sep 23 '11 10:09

Alex B


People also ask

Does LTO improve performance?

LTO provides a performance boost for all the compilers. With small projects, this boost probably wouldn't be noticeable, but for big ones this option definitely makes a difference.

What is nvcc used for?

It is the purpose of nvcc , the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers. It accepts a range of conventional compiler options, such as for defining macros and include/library paths, and for steering the compilation process.

How does nvcc compiler work?

NVCC is a compiler driver which works by invoking all the necessary tools and compilers like cudacc, g++, cl, etc. NVCC can output either C code (CPU Code) that must then be compiled with the rest of the application using another tool or PTX or object code directly.

Does nvcc use GCC?

nvcc is the compiler driver used to compile both . cu and . cpp files. It uses the cl.exe (on Windows) or gcc (on Linux) executable that it can find as the compiler for host code.


1 Answers

The compiler is way ahead of you. In your example:

float a = 1 / sqrtf(2 * M_PI);
float b = c / sqrtf(2 * M_PI);

nvopencc (Open64) will emit this:

    mov.f32         %f2, 0f40206c99;        // 2.50663
    div.full.f32    %f3, %f1, %f2;
    mov.f32         %f4, 0f3ecc422a;        // 0.398942

which is equivalent to

float b = c / 2.50663f;
float a = 0.398942f;

The second case gets compiled to this:

float a = 1 / sqrtf(c * 3.14159f); // 0f40490fdb
float b = c / 1.77245f; // 0f3fe2dfc5

I am guessing the expression for a generated by the compiler should be more accurate than your "optmized" version, but about the same speed.

like image 84
talonmies Avatar answered Sep 18 '22 20:09

talonmies