I really need very fast round() function in C - it is necessary for Monte Carlo particle modeling: at every step you need to wrap coordinates into periodic box to compute volume interactions : for example
for(int i=0; i < 3; i++)
{
coor.x[i] = a.XReal.x[i]-b.XReal.x[i];
coor.x[i] = coor.x[i] - SIZE[i]*round(coor.x[i]/SIZE[i]); //PBC
}
I've come across some asm hacking with it, but i don't understand asm at all:) something like this
inline int float2int2(float flt)
{
int intgr;
__asm__ __volatile__ ("fld %1; fistp %0;" : "=m" (intgr) : "m" (flt));
return intgr;
}
With fixed boundaries, without round() it works faster. So, maybe someone knows a better way?..
First of all, you can get some gains by using the right compiler options. With GCC and a modern Intel CPU for example, you should try:
-march=nehalem -fno-trapping-math
Then the problem with round
is that it uses a specific rounding mode, which is slow on most platforms. nearbyint
(or rint
) should always be faster:
coor.x[i] = coor.x[i] - SIZE[i] * nearbyint(coor.x[i] / SIZE[i])
Have a look a the generated assembly.
You should also think about vectorizing your code.
Instead of looking for just fast rounding, ideally you want the whole process of range-reduction into the periodic box to be fast. As @EOF accurately pointed out in a comment, you could use a C99 standard function like remainderf()
or fmodf()
.
coor.x[i] -= SIZE[i]*round(coor.x[i]/SIZE[i]);
// same as
coor.x[i] = remainderf(coor.x[i], SIZE[i]);
fmodf(3)
rounds towards zero, remainderf(3)
rounds towards nearest.
The
remainder()
function computes the remainder of dividingx
byy
. The return value isx-n*y
, wheren
is the valuex / y
, rounded to the nearest integer. If the absolute value ofx-n*y
is 0.5, n is chosen to be even.
Compilers / libraries have several different strategies for implementing these. With -ffast-math
, gcc 5.3 for x86-64 inlines a remainder(x,y)
implementation that transfers the values from SSE registers to x87 registers, and runs FPREM1
(partial remainder) in a loop until it sets a flag indicating that the result is correct. (One execution of FPREM1
can reduce the exponent by at most 63).
clang always emits a call to the library function, either the normal remainder
entry point, or __remainder_finite
with -ffast-math
.
The GNU libm definition uses mostly integer operations, AFAICT from the disassembly and the C source. On a recent Intel CPU with fast hardware divide, it might be slower than your div, round, mul version.
div, round, mul, sub, with fast rounding (use nearbyint()
, it apparently has the least ugly semantics so it can inline to roundsd
/ roundss
most easily). This way can vectorize, and do all three coordinates at once. May need to do it manually, to find something that won't fault for the 4th element. On Intel Haswell with 128b vectors: 5 uops. single-precision: divps
(10-13c latency, one per 7c throughput), roundps
(2 uops, 6c latency, one per 2c throughput), mulps
(5c latency, one per 0.5c throughput), subps
(3c latency, one per 1c throughput). Some of these compete with each other for execution ports. Total latency: 27c. Probable throughput, maybe something like one per 7c (totally bottlenecked by divps)
gcc's inlined x87 FPREM1
. (probably only needs to run one iteration, so on Haswell: 41 uops, 27c latency, one per 17c throughput, plus some overhead for getting data between xmm and x87 regs. Can't vectorize.
glibc's mostly-integer implementation: no idea, probably worse than either of the other two, on modern x86 CPUs. But, probably significantly higher accuracy than the manual div/round/mul/sub.
Bottom line, if this is a speed issue, you should definitely look into vectorizing with SSE/AVX to do all three coordinates of a point in one vector. Or, a coordinate of four points at once, or whatever is convenient. Ideally you can make use of all 4 (or 8 with AVX) single-precision elements of the vector ALUs. (or 2 / 4 for double-precision).
Even scalar, I think your current code with nearbyint()
is going to be the fastest choice, but you can easily go three times faster than that with vectors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With