Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to multiply by 2 on SIMD (without using muliplication)?

A trick with the old floats used to be to never multiply by 2 but to add an operand with itself, as, 2*a = a + a. Is the old trick still feasible to use with SSE/SSE2/SSSE3/NEON/... instruction sets and the like today? My operand would be a vector (say, 4 floats, that I want to multiply by 2). What about multiplying by 3, 4 ...?

like image 504
user1095108 Avatar asked Dec 25 '22 05:12

user1095108


2 Answers

Compiler writers are clever. For floating point numbers x, 2.0 * x and x + x are absolutely identical. Therefore, a compiler is quite capable to replace 2.0*x with x + x and vice versa, depending what is faster.

And that can be complicated. Addition is usually faster. But consider a processor which can do say one multiplication and an addition per cycle. Then you would want to replace 2*x and 2*y with 2*x and y+y. If you have an operation 2*x and y+z then you don't want to replace 2*x with x+x because then you have two additions which you can only do in two cycles. Then there are processors with fused-multiply add which can calculate a*b + c in one operation. So you wouldn't want to change 2*x + y to (x + x) + y.

Best to leave it to the compiler.

like image 27
gnasher729 Avatar answered Feb 13 '23 20:02

gnasher729


I'm still trying to find an example of where this would make a difference. My gut feeling is that if latency is an issue there are cases where x+x would be better but if latency is not an issue and only throughput matters then it could be worse. But first let's discuss some hardware.

Let me stick to Intel x86 processors since that's what I know best. Let's consider the following generations of hardware: Core2/Nehalem, SandyBridge/IvyBridge, and Haswell/Broadwell.

Latency and throughput for SIMD floating pointer arithmetic operations:

  • The latency for addition is 3.
  • Except for Broadwell the latency for multiplication is 5.
  • On Broadwell multiplication has a latency of 3.
  • The throughput for addition is 1.
  • Except for Haswell and Broadwell the throughput for multiplication is one 1.
  • On Haswell and Broadwell the throughput for multiplicatin is 2.
  • The throughput for addition and multiplication without FMA is 2.
  • The latency for FMA is 5
  • The throughput for FMA is 2. This is equivalent to an addition and multiplication throughput of 4.

Here is a case I'm actually using for generating the Mandelbrot set which has a factor of 2. In the main loop two of the most critical lines of code are:

x = x*x - y*y + x0;
y = 2*xtemp*y + y0;

All of the varible here are SIMD (SSE or AVX) registers so I'm acting on multiple pixels at once (4 with SSE, 8 with AVX for single floating point). I'm using a SIMD class wrapped around the intrinsics for this. For y I could do instead

y = xtemp*y + xtemp*y + y0

What about with FMA?

y = fma(2*xtemp, y, y0)

or

y = xtemp*y + fma(xtemp, y, y0);

There are many variations that could be tried. I have not tried y=xtemp*y + xtemp*y + y0 but I think it would be worse. Incidentally FMA turns out, so far the way I have implemented it on my Haswell system, not to help much. My frame rate only increases 15% or so using FMA whereas when I go from using 4 pixels with SSE to 8 pixels with AVX it nearly doubles.

Edit: here are some cases which I though would make a difference but either they don't in practice or they would not make sense to do.

Consider this case

for(int i=0; i<n; i++) y[i] = 2*x[i];

In this case latency does not matter, throughput matters. On Haswell and Broadwell the throughput for multiplication is twice addition so in this case so it might seem that it would be worse to do x+x but since Haswell/Broadwell can only write 32-bytes per clock cycle it does not make a difference.

Here is a case where using x+x would seem to be better.

for(int i=0; i<n; i++) prod = prod * (2*x[i]);

Instead you could do this:

for(int i=0; i<n; i++) prod = prod * (x[i]+x[i]);

In both these cases it would make no difference since they are dominated by the latency of the multiplication of prod. However, if you unrolled the loops enough times so that latency did not matter then the second case would be better in general since all the processors can do addition and multiplication at least every clock cycle. Although Haswell and Broadwell can do two multiplications per clock cycle they can also do two multiplications and additions per clock cycle using FMA so even on them this would be better.

However, in this case the smart thing to do is

for(int i=0; i<n; i++) prod *= x[i];
prod *= pow(2,n);

So it would not be necessary to do x+x instead of 2*x.

like image 197
Z boson Avatar answered Feb 13 '23 22:02

Z boson