Why is floor() so slow?

People also ask

Is math Floor slow?

It suggests that Math. floor is the SLOWEST way to calculate floor in Javascript.

What does floor () do in C++?

The floor() function in C++ returns the largest possible integer value which is less than or equal to the given argument.

A couple of things make floor slower than a cast and prevent vectorization.

The most important one:

floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).

Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.

Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.

Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):

inline int int_floor(double x)
{
  int i = (int)x; /* truncate */
  return i - ( i > x ); /* convert trunc to floor */
}

Branch-less Floor and Ceiling (better utilize the pipiline) no error check

int f(double x)
{
    return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}

int c(double x)
{
    return (int) x + (x > (int) x);
}

or using floor

int c(double x)
{
    return -(f(-x));
}

The actual fastest implementation for a large array on modern x86 CPUs would be

change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)

cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)

This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.

With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.

Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)

(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)

Related questions
                            
                                Logic differences in C and Java
                            
                                Wrapping C++ class API for C consumption
                            
                                Pointer to array of unspecified size "(*p)[]" illegal in C++ but legal in C
                            
                                When to use const void*?
                            
                                Is C NULL equal to C++11 nullptr
                            
                                How does make know which files to update
                            
                                Reading numbers from a text file into an array in C
                            
                                if statement integer
                            
                                Is this undefined C behaviour?
                            
                                Why does ((unsigned char)0x80) << 24 get sign extended to 0xFFFFFFFF80000000 (64-bit)?
                            
                                C macro: #if check for equality
                            
                                ld: undefined reference to symbol 'log2@@GLIBC_2.2.5'
                            
                                How to read a line from stdin, blocking until the newline is found?
                            
                                what is the difference between uint16_t and unsigned short int incase of 64 bit processor?
                            
                                Header file included only once in entire program?
                            
                                What is the difference between a segmentation fault and a stack overflow?
                            
                                "Press Any Key to Continue" function in C
                            
                                Type to use to represent a byte in ANSI (C89/90) C?
                            
                                Why does printf() promote a float to a double?
                            
                                Why can't I access a pointer to pointer for a stack array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is floor() so slow?

Tags:

performance

c

x86

intel

visual-c++

People also ask

Recent Activity

Donate For Us