I was benchmarking some essential routines by executing cycles such as:
float *src, *dst;
for (int i=0; i<cnt; i++) dst[i] = round(src[i]);
All with AVX2 target, newest CLANG. Interestingly floor(x), ceil(x), int(x)... all seem fast. But round(x) seems exremely slow and looking into disassembly there's some weird spaghetti code instead of the newer SSE or AVX versions. Even when blocking the ability to vectorize the loops by introducing some dependency, round is like 10x slower. For floor etc. the generated code uses vroundss, for round there's the spaghetti code... Any ideas?
Edit: I'm using -ffast-math, -mfpmath=sse, -fno-math-errno, -O3, -std=c++17, -march=core-avx2 -mavx2 -mfma
The problem is that none of the SSE rounding modes specify the correct rounding for round:
These functions round x to the nearest integer, but round halfway cases away from zero (regardless of the current rounding direction, see fenv(3)), instead of to the nearest even integer like rint(3).
If you want faster code, you could try testing rint instead of round, as that specifies a rounding mode that SSE does support.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With