Floating multiplication performing slower depending of operands in C

Tags:

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.

The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:

#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]

for (cur_time=0; cur_time<timeSteps; cur_time++) {
    for (i=1; i<rows-1; i++) {
        for (j=1; j<cols-1; j++) {
            PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
        }
    }
    // Change pointers for next iteration
    auxP = wMat;
    wMat = rMat;
    rMat = auxP;
}

The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.

What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).

I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".

This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.

In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.

335

asked Mar 03 '11 11:03

Nitros

1 Answers

The problem has already mostly been diagnosed, but I will write up exactly what happens here.

Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).

The smallest normalized single-precision value is 2^-126; when cur_time = 55, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.

When the grid is initially filled with constant data of 1.0, the data never gets too small, and so the denormal stall is avoided.

Note that changing the data type to double would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.

At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the <fenv.h> header in the C library.

Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (0x1.0p-126f, the smallest normal number). This would also prevent denormals from arising in the computation.

answered Oct 28 '22 04:10

Stephen Canon

Related questions
                            
                                Is mutex needed to synchronize a simple flag between pthreads?
                            
                                Why does RegCloseKey exist (when CloseHandle seems to perform the same function)?
                            
                                What is the type of a bitfield?
                            
                                Prevent PLT (procedure linkage table) breakpoints in GDB
                            
                                Possible Memory Leak Valgrind in OSX El Capitan
                            
                                C11 anonymous structs via typedefs?
                            
                                Implementing the ls -al command in C
                            
                                What exactly do I lose when using extern "C" in C++? [duplicate]
                            
                                Behavior of sizeof on variable length arrays (C only)
                            
                                Is there a difference between the _Atomic type qualifier and type specifier?
                            
                                Are float inequalities guaranteed to be consistent
                            
                                Is long long a type in C?
                            
                                Is there a nice way of handling multi-line input with GNU readline?
                            
                                Is it possible to have a variadic function in C with no non-variadic parameter?
                            
                                How can i match each /proc/net/tcp entry to each opened socket?
                            
                                How to rotate an SSE/AVX vector
                            
                                Why prevent a file from opening as controlling terminal (with O_NOCTTY)?
                            
                                Running multiple concurrent GMainLoops
                            
                                What is the rationale for not including strdup in the C Standard?
                            
                                What precisely does the %g printf specifier mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Floating multiplication performing slower depending of operands in C

Tags:

performance

c

floating-point

x86

intel

stencils

Nitros

People also ask

1 Answers

Stephen Canon

Recent Activity

Donate For Us