Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Super-fast rounding function (PBC)

I really need very fast round() function in C - it is necessary for Monte Carlo particle modeling: at every step you need to wrap coordinates into periodic box to compute volume interactions : for example

for(int i=0; i < 3; i++)
{
    coor.x[i] = a.XReal.x[i]-b.XReal.x[i];
    coor.x[i] = coor.x[i] - SIZE[i]*round(coor.x[i]/SIZE[i]); //PBC
}

I've come across some asm hacking with it, but i don't understand asm at all:) something like this

inline int float2int2(float flt)
{
  int intgr;

  __asm__ __volatile__ ("fld %1; fistp %0;" : "=m" (intgr) : "m" (flt));

  return intgr;
}

With fixed boundaries, without round() it works faster. So, maybe someone knows a better way?..

like image 973
Rsmmnt Avatar asked Apr 08 '16 11:04

Rsmmnt


2 Answers

First of all, you can get some gains by using the right compiler options. With GCC and a modern Intel CPU for example, you should try:

-march=nehalem -fno-trapping-math

Then the problem with round is that it uses a specific rounding mode, which is slow on most platforms. nearbyint (or rint) should always be faster:

coor.x[i] = coor.x[i] - SIZE[i] * nearbyint(coor.x[i] / SIZE[i])

Have a look a the generated assembly.

You should also think about vectorizing your code.

like image 67
nwellnhof Avatar answered Oct 02 '22 15:10

nwellnhof


Instead of looking for just fast rounding, ideally you want the whole process of range-reduction into the periodic box to be fast. As @EOF accurately pointed out in a comment, you could use a C99 standard function like remainderf() or fmodf().

coor.x[i] -= SIZE[i]*round(coor.x[i]/SIZE[i]);
// same as
coor.x[i] = remainderf(coor.x[i], SIZE[i]);

fmodf(3) rounds towards zero, remainderf(3) rounds towards nearest.

The remainder() function computes the remainder of dividing x by y. The return value is x-n*y, where n is the value x / y, rounded to the nearest integer. If the absolute value of x-n*y is 0.5, n is chosen to be even.

Compilers / libraries have several different strategies for implementing these. With -ffast-math, gcc 5.3 for x86-64 inlines a remainder(x,y) implementation that transfers the values from SSE registers to x87 registers, and runs FPREM1 (partial remainder) in a loop until it sets a flag indicating that the result is correct. (One execution of FPREM1 can reduce the exponent by at most 63).

clang always emits a call to the library function, either the normal remainder entry point, or __remainder_finite with -ffast-math.

The GNU libm definition uses mostly integer operations, AFAICT from the disassembly and the C source. On a recent Intel CPU with fast hardware divide, it might be slower than your div, round, mul version.


So you have three options:

  • div, round, mul, sub, with fast rounding (use nearbyint(), it apparently has the least ugly semantics so it can inline to roundsd / roundss most easily). This way can vectorize, and do all three coordinates at once. May need to do it manually, to find something that won't fault for the 4th element. On Intel Haswell with 128b vectors: 5 uops. single-precision: divps(10-13c latency, one per 7c throughput), roundps(2 uops, 6c latency, one per 2c throughput), mulps(5c latency, one per 0.5c throughput), subps(3c latency, one per 1c throughput). Some of these compete with each other for execution ports. Total latency: 27c. Probable throughput, maybe something like one per 7c (totally bottlenecked by divps)

  • gcc's inlined x87 FPREM1. (probably only needs to run one iteration, so on Haswell: 41 uops, 27c latency, one per 17c throughput, plus some overhead for getting data between xmm and x87 regs. Can't vectorize.

  • glibc's mostly-integer implementation: no idea, probably worse than either of the other two, on modern x86 CPUs. But, probably significantly higher accuracy than the manual div/round/mul/sub.


Bottom line, if this is a speed issue, you should definitely look into vectorizing with SSE/AVX to do all three coordinates of a point in one vector. Or, a coordinate of four points at once, or whatever is convenient. Ideally you can make use of all 4 (or 8 with AVX) single-precision elements of the vector ALUs. (or 2 / 4 for double-precision).

Even scalar, I think your current code with nearbyint() is going to be the fastest choice, but you can easily go three times faster than that with vectors.

like image 34
Peter Cordes Avatar answered Oct 02 '22 17:10

Peter Cordes