Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient integer floor function in C++

I want to define an efficient integer floor function, i.e. a conversion from float or double that performs truncation towards minus infinity.

We can assume that the values are such that no integer overflow occurs. So far I have a few options

  • casting to int; this requires special handling of the negative values, as the cast truncates toward zero;

    I= int(F); if (I < 0 && I != F) I--; 
  • casting the result of floor to int;

    int(floor(F)); 
  • casting to int with a large shift to get positives (this can return wrong results for large values);

    int(F + double(0x7fffffff)) - 0x7fffffff; 

Casting to int is notoriously slow. So are if tests. I have not timed the floor function, but seen posts claiming it is also slow.

Can you think of better alternatives in terms of speed, accuracy or allowed range ? It doesn't need to be portable. The targets are recent x86/x64 architectures.

like image 916
Yves Daoust Avatar asked Jun 02 '19 17:06

Yves Daoust


People also ask

What does floor () do in C?

The floor() function in C is used to convert a floating point number to its immediately smaller integer (for eg, 3.6 to 3). This function is a part of <math. h> header file.

Is the greatest integer function the floor function?

The floor function of x, historically called the greatest integer function, is the greatest integer less than or equal to x. This function is sometimes written [x], but is best written (a notation that was suggested by Iverson in 1962) to differentiate it from the ceiling function.

Is int function floor?

The "Int" function (short for "integer") is like the "Floor" function, BUT some calculators and computer programs show different results when given negative numbers: Some say int(−3.65) = −4 (the same as the Floor function)

Is floor Division available in C?

the answer is yes.


2 Answers

Casting to int is notoriously slow.

Maybe you've been living under a rock since x86-64, or otherwise missed that this hasn't been true for a while on x86. :)

SSE/SSE2 have an instruction to convert with truncation (instead of the default rounding mode). The ISA supports this operation efficiently precisely because conversion with C semantics is not rare in actual codebases. x86-64 code uses SSE/SSE2 XMM registers for scalar FP math, not x87, because of this and other things that make it more efficient. Even modern 32-bit code uses XMM registers for scalar math.

When compiling for x87 (without SSE3 fisttp), compilers used to have to change the x87 rounding mode to truncation, FP store to memory, then change the rounding mode back again. (And then reload the integer from memory, typically from a local on the stack, if doing further stuff with it.) x87 was terrible for this.

Yes that was horribly slow, e.g. in 2006 when the link in @Kirjain's answer was written, if you still had a 32-bit CPU or were using an x86-64 CPU to run 32-bit code.


Converting with a rounding mode other than truncation or default (nearest) isn't directly supported, and until SSE4.1 roundps/roundpd your best bet was magic-number tricks like in the 2006 link from @Kirjain's answer.

Some nice tricks there, but only for double -> 32-bit integer. Unlikely to be worth expanding to double if you have float.

Or more usually, simply add a large-magnitude number to trigger rounding, then subtract it again to get back to the original range. This can work for float without expanding to double, but I'm not sure how easy it is to make floor work.


Anyway, the obvious solution here is _mm256_floor_ps() and _mm256_cvtps_epi32 (vroundps and vcvtps2dq). A non-AVX version of this can work with SSE4.1.

I'm not sure if we can do even better; If you had a huge array to process (and couldn't manage to interleave this work with other work), you could set the MXCSR rounding mode to "towards -Inf" (floor) and simply use vcvtps2dq (which uses the current rounding mode). Then set it back. But it's probably better to cache-block your conversion or do it on the fly as you generate the data, presumably from other FP calculations that need the FP rounding mode set to the default Nearest.

roundps/pd/ss/sd is 2 uops on Intel CPUs, but only 1 uop (per 128-bit lane) on AMD Ryzen. cvtps2dq is also 1 uop. packed double->int conversion also includes a shuffle. Scalar FP->int conversion (that copies to an integer register) usually also costs an extra uop for that.

So there's room for the possibility of magic-number tricks being a win in some cases; it maybe worth investigating if _mm256_floor_ps() + cvt are part of a critical bottleneck (or more likely if you have double and want int32).


@Cássio Renan's int foo = floorf(f) will actually auto-vectorize if compiled with gcc -O3 -fno-trapping-math (or -ffast-math), with -march= something that has SSE4.1 or AVX. https://godbolt.org/z/ae_KPv

That's maybe useful if you're using this with other scalar code that's not manually vectorized. Especially if you're hoping the compiler will auto-vectorize the whole thing.

like image 79
Peter Cordes Avatar answered Oct 03 '22 02:10

Peter Cordes


Have a look at magic numbers. The algorithm proposed on the web page should be far more efficient than simple casting. I've never used it myself, but this is the performance comparison they offer on the site (xs_ToInt and xs_CRoundToInt are the proposed functions):

Performing 10000000 times: simple cast           2819 ms i.e. i = (long)f; xs_ToInt              1242 ms i.e. i = xs_ToInt(f); //numerically same as above bit-twiddle(full)     1093 ms i.e. i = BitConvertToInt(f); //rounding from Fluid fistp                  676 ms i.e. i = FISTToInt(f); //Herf, et al x86 Assembly rounding  bit-twiddle(limited)   623 ms i.e. i = FloatTo23Bits(f); //Herf, rounding only in the range (0...1]   xs_CRoundToInt         609 ms i.e. i = xs_CRoundToInt(f); //rounding with "magic" numbers 

Further, the xs_ToInt is apparently modified so that the performance improves:

Performing 10000000 times: simple cast convert   3186 ms i.e. fi = (f*65536); fistp convert         3031 ms i.e. fi = FISTToInt(f*65536); xs_ToFix               622 ms i.e. fi = xs_Fix<16>::ToFix(f); 

Brief explanation of how the 'magic numbers' method works:

"Basically, in order to add two floating point numbers, your processor "lines up" the decimal points of the numbers so that it can easily add the bits. It does this by "normalizing" the numbers such that the most significant bits are preserved, i.e. the smaller number "normalizes" to match the bigger one. So the principle of the "magic number" conversion that xs_CRoundToInt() uses is this: We add a big enough floating point number (a number that is so big that there are significant digits only UP TO the decimal point, and none after it) to the one you're converting such that: (a) the number gets normalized by the processor to its integer equivalent and (b) adding the two does not erase the integral significat bits in the number you were trying to convert (i.e. XX00 + 00YY = XXYY)."

The quote is taken from the same web page.

like image 35
borizzzzz Avatar answered Oct 03 '22 01:10

borizzzzz