I am aware of 3 methods, but as far as I know, only the first 2 are generally used: <ol> <li> Mask off the sign bit using <code>andps</code> or <code>andnotps</code>. <ul> <li>Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop.</li> <li>Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch.</li> </ul> </li> <li> Subtract the value from zero to negate, and then get the max of the original and negated. <ul> <li>Pros: Fixed cost because nothing is needed to fetch, like a mask.</li> <li>Cons: Will always be slower than the mask method if conditions are ideal, and we have to wait for the <code>subps</code> to complete before using the <code>maxps</code> instruction.</li> </ul> </li> <li> Similar to option 2, subtract the original value from zero to negate, but then "bitwise and" the result with the original using <code>andps</code>. I ran a test comparing this to method 2, and it seems to behave identically to method 2, aside from when dealing with <code>NaN</code>s, in which case the result will be a different <code>NaN</code> than method 2's result. <ul> <li>Pros: Should be slightly faster than method 2 because <code>andps</code> is usually faster than <code>maxps</code>.</li> <li>Cons: Can this result in any unintended behavior when <code>NaN</code>s are involved? Maybe not, because a <code>NaN</code> is still a <code>NaN</code>, even if it's a different value of <code>NaN</code>, right?</li> </ul> </li> </ol> Thoughts and opinions are welcome.

TL;DR: In almost all cases, use pcmpeq/shift to generate a mask, and andps to use it. It has the shortest critical path by far (tied with constant-from-memory), and can't cache-miss. <h3>How to do that with intrinsics</h3> Getting the compiler to emit <code>pcmpeqd</code> on an uninitialized register can be tricky. (godbolt). The best way for gcc / icc looks to be <pre class="prettyprint"><code>__m128 abs_mask(void){ // with clang, this turns into a 16B load, // with every calling function getting its own copy of the mask __m128i minus1 = _mm_set1_epi32(-1); return _mm_castsi128_ps(_mm_srli_epi32(minus1, 1)); } // MSVC is BAD when inlining this into loops __m128 vecabs_and(__m128 v) { return _mm_and_ps(abs_mask(), v); } __m128 sumabs(const __m128 *a) { // quick and dirty no alignment checks __m128 sum = vecabs_and(*a); for (int i=1 ; i < 10000 ; i++) { // gcc, clang, and icc hoist the mask setup out of the loop after inlining // MSVC doesn't! sum = _mm_add_ps(sum, vecabs_and(a[i])); // one accumulator makes addps latency the bottleneck, not throughput } return sum; } </code></pre> clang 3.5 and later "optimizes" the set1 / shift into loading a constant from memory. It will use <code>pcmpeqd</code> to implement <code>set1_epi32(-1)</code>, though. TODO: find a sequence of intrinsics that produces the desired machine code with clang. Loading a constant from memory isn't a performance disaster, but having every function use a different copy of the mask is pretty terrible. MSVC: VS2013: <ul> <li><code>_mm_uninitialized_si128()</code> is not defined.</li> <li><code>_mm_cmpeq_epi32(self,self)</code> on an uninitialized variable will emit a <code>movdqa xmm, [ebp-10h]</code> in this test case (i.e. load some uninitialized data from the stack. This has less risk of a cache miss than just loading the final constant from memory. However, Kumputer says MSVC didn't manage to hoist the pcmpeqd / psrld out of the loop (I assume when inlining <code>vecabs</code>), so this is unusable unless you manually inline and hoist the constant out of a loop yourself.</li> <li>Using <code>_mm_srli_epi32(_mm_set1_epi32(-1), 1)</code> results in a movdqa to load a vector of all -1 (hoisted outside the loop), and a <code>psrld</code> inside the loop. So that's completely horrible. If you're going to load a 16B constant, it should be the final vector. Having integer instructions generating the mask every loop iteration is also horrible.</li> </ul> Suggestions for MSVC: Give up on generating the mask on the fly, and just write <pre class="prettyprint"><code>const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31)); </code></pre> Probably you'll just get the mask stored in memory as a 16B constant. Hopefully not duplicated for every function that uses it. Having the mask in a memory constant is more likely to be helpful in 32bit code, where you only have 8 XMM registers, so <code>vecabs</code> can just ANDPS with a memory source operand if it doesn't have a register free to keep a constant lying around. TODO: find out how to avoid duplicating the constant everywhere it's inlined. Probably using a global constant, rather than an anonymous <code>set1</code>, would be good. But then you need to initialize it, but I'm not sure intrinsics work as initializers for global <code>__m128</code> variables. You want it to go in the read-only data section, not to have a constructor that runs at program startup. <hr> Alternatively, use <pre class="prettyprint"><code>__m128i minus1; // undefined #if _MSC_VER && !__INTEL_COMPILER minus1 = _mm_setzero_si128(); // PXOR is cheaper than MSVC's silly load from the stack #endif minus1 = _mm_cmpeq_epi32(minus1, minus1); // or use some other variable here, which will probably cost a mov insn without AVX, unless the variable is dead. const __m128 absmask = _mm_castsi128_ps(_mm_srli_epi32(minus1, 1)); </code></pre> The extra PXOR is quite cheap, but it's still a uop and still 4 bytes on code size. If anyone has any better solution to overcome MSVC's reluctance to emit the code we want, leave a comment or edit. This is no good if inlined into a loop, though, because the pxor/pcmp/psrl will all be inside the loop. Loading a 32bit constant with <code>movd</code> and broadcasting with <code>shufps</code> might be ok (again, you probably have to manually hoist this out of a loop, though). That's 3 instructions (mov-immediate to a GP reg, movd, shufps), and movd is slow on AMD where the vector unit is shared between two integer cores. (Their version of hyperthreading.) <hr> <h3>Choosing the best asm sequence</h3> Ok, lets look at this for let's say Intel Sandybridge through Skylake, with a bit of mention of Nehalem. See Agner Fog's microarch guides and instruction timings for how I worked this out. I also used Skylake numbers someone linked in a post on the http://realwordtech.com/ forums. <hr> Lets say the vector we want to <code>abs()</code> is in <code>xmm0</code>, and is part of a long dependency chain as is typical for FP code. So lets assume any operations that don't depend on <code>xmm0</code> can begin several cycles before <code>xmm0</code> is ready. I've tested, and instructions with memory operands don't add extra latency to a dependency chain, assuming the address of the memory operand isn't part of the dep chain (i.e. isn't part of the critical path). <hr> I'm not totally clear on how early a memory operation can start when it's part of a micro-fused uop. As I understand it, the Re-Order Buffer (ROB) works with fused uops, and tracks uops from issue to retirement (168(SnB) to 224(SKL) entries). There's also a scheduler that works in the unfused domain, holding only uops that have their input operands ready but haven't yet executed. uops can issue into the ROB (fused) and scheduler (unfused) at the same time when they're decoded (or loaded from the uop cache). If I'm understanding this correctly, it's 54 to 64 entries in Sandybridge to Broadwell, and 97 in Skylake. There's some unfounded speculation about it not being a unified (ALU/load-store) scheduler anymore. There's also talk of Skylake handling 6 uops per clock. As I understand it, Skylake will read whole uop-cache lines (up to 6 uops) per clock into a buffer between the uop cache and the ROB. Issue into the ROB/scheduler is still 4-wide. (Even <code>nop</code> is still 4 per clock). This buffer helps where code alignment / uop cache line boundaries cause bottlenecks for previous Sandybridge-microarch designs. I previously thought this "issue queue" was this buffer, but apparently it isn't. However it works, the scheduler is large enough to get the data from cache ready in time, if the address isn't on the critical path. <hr> <h3>1a: mask with a memory operand</h3> <pre class="prettyprint"><code>ANDPS xmm0, [mask] # in the loop </code></pre> <ul> <li>bytes: 7 insn, 16 data. (AVX: 8 insn)</li> <li>fused-domain uops: 1 * n</li> <li>latency added to critical path: 1c (assuming L1 cache hit)</li> <li>throughput: 1/c. (Skylake: 2/c) (limited by 2 loads / c)</li> <li>"latency" if <code>xmm0</code> was ready when this insn issued: ~4c on an L1 cache hit.</li> </ul> <hr> <h3>1b: mask from a register</h3> <pre class="prettyprint"><code>movaps xmm5, [mask] # outside the loop ANDPS xmm0, xmm5 # in a loop # or PAND xmm0, xmm5 # higher latency, but more throughput on Nehalem to Broadwell # or with an inverted mask, if set1_epi32(0x80000000) is useful for something else in your loop: VANDNPS xmm0, xmm5, xmm0 # It's the dest that's NOTted, so non-AVX would need an extra movaps </code></pre> <ul> <li>bytes: 10 insn + 16 data. (AVX: 12 insn bytes)</li> <li>fused-domain uops: 1 + 1*n</li> <li>latency added to a dep chain: 1c (with the same cache-miss caveat for early in the loop)</li> <li>throughput: 1/c. (Skylake: 3/c) </li> </ul> <code>PAND</code> is throughput 3/c on Nehalem to Broadwell, but latency=3c (if used between two FP-domain operations, and even worse on Nehalem). I guess only port5 has the wiring to forward bitwise ops directly to the other FP execution units (pre Skylake). Pre-Nehalem, and on AMD, bitwise FP ops are treated identically to integer FP ops, so they can run on all ports, but have a forwarding delay. <hr> <h3>1c: generate the mask on the fly:</h3> <pre class="prettyprint"><code># outside a loop PCMPEQD xmm5, xmm5 # set to 0xff... Recognized as independent of the old value of xmm5, but still takes an execution port (p1/p5). PSRLD xmm5, 1 # 0x7fff... # port0 # or PSLLD xmm5, 31 # 0x8000... to set up for ANDNPS ANDPS xmm0, xmm5 # in the loop. # port5 </code></pre> <ul> <li>bytes: 12 (AVX: 13)</li> <li>fused-domain uops: 2 + 1*n (no memory ops)</li> <li>latency added to a dep chain: 1c</li> <li>throughput: 1/c. (Skylake: 3/c) </li> <li>throughput for all 3 uops: 1/c saturating all 3 vector ALU ports</li> <li>"latency" if <code>xmm0</code> was ready when this sequence issued (no loop): 3c (+1c possible bypass delay on SnB/IvB if ANDPS has to wait for integer data to be ready. Agner Fog says in some cases there's no extra delay for integer->FP-boolean on SnB/IvB.)</li> </ul> This version still takes less memory than versions with a 16B constant in memory. It's also ideal for an infrequently-called function, because there's no load to suffer a cache miss. The "bypass delay" shouldn't be an issue. If xmm0 is part of a long dependency chain, the mask-generating instructions will execute well ahead of time, so the integer result in xmm5 will have time to reach ANDPS before xmm0 is ready, even if it takes the slow lane. Haswell has no bypass delay for integer results -> FP boolean, according to Agner Fog's testing. His description for SnB/IvB says this is the case with the outputs of some integer instructions. So even in the "standing start" beginning-of-a-dep-chain case where <code>xmm0</code> is ready when this instruction sequence issues, it's only 3c on *well, 4c on *Bridge. Latency probably doesn't matter if the execution units are clearing the backlog of uops as fast as they're being issued. Either way, ANDPS's output will be in the FP domain, and have no bypass delay if used in <code>MULPS</code> or something. On Nehalem, bypass delays are 2c. So at the start of a dep chain (e.g. after a branch mispredict or I$ miss) on Nehalem, "latency" if <code>xmm0</code> was ready when this sequence issued is 5c. If you care a lot about Nehalem, and expect this code to be the first thing that runs after frequent branch mispredicts or similar pipeline stalls that leaves the OoOE machinery unable to get started on calculating the mask before <code>xmm0</code> is ready, then this might not be the best choice for non-loop situations. <hr> <h3>2a: AVX max(x, 0-x)</h3> <pre class="prettyprint"><code>VXORPS xmm5, xmm5, xmm5 # outside the loop VSUBPS xmm1, xmm5, xmm0 # inside the loop VMAXPS xmm0, xmm0, xmm1 </code></pre> <ul> <li>bytes: AVX: 12 </li> <li>fused-domain uops: 1 + 2*n (no memory ops)</li> <li>latency added to a dep chain: 6c (Skylake: 8c)</li> <li>throughput: 1 per 2c (two port1 uops). (Skylake: 1/c, assuming <code>MAXPS</code> uses the same two ports as <code>SUBPS</code>.)</li> </ul> Skylake drops the separate vector-FP add unit, and does vector adds in the FMA units on ports 0 and 1. This doubles FP add throughput, at the cost of 1c more latency. The FMA latency is down to 4 (from 5 in *well). x87 <code>FADD</code> is still 3 cycle latency, so there's still a 3-cycle scalar 80bit-FP adder, but only on one port. <h3>2b: same but without AVX:</h3> <pre class="prettyprint"><code># inside the loop XORPS xmm1, xmm1 # not on the critical path, and doesn't even take an execution unit on SnB and later SUBPS xmm1, xmm0 MAXPS xmm0, xmm1 </code></pre> <ul> <li>bytes: 9</li> <li>fused-domain uops: 3*n (no memory ops)</li> <li>latency added to a dep chain: 6c (Skylake: 8c)</li> <li>throughput: 1 per 2c (two port1 uops). (Skylake: 1/c)</li> <li>"latency" if <code>xmm0</code> was ready when this sequence issued (no loop): same</li> </ul> Zeroing a register with a zeroing-idiom that the processor recognizes (like <code>xorps same,same</code>) is handled during register rename on Sandbridge-family microarchitectures, and has zero latency, and throughput of 4/c. (Same as reg->reg moves that IvyBridge and later can eliminate.) It's not free, though: It still takes a uop in the fused domain, so if your code is only bottlenecked by the 4uop/cycle issue rate, this will slow you down. This is more likely with hyperthreading. <hr> <h3>3: ANDPS(x, 0-x)</h3> <pre class="prettyprint"><code>VXORPS xmm5, xmm5, xmm5 # outside the loop. Without AVX: zero xmm1 inside the loop VSUBPS xmm1, xmm5, xmm0 # inside the loop VANDPS xmm0, xmm0, xmm1 </code></pre> <ul> <li>bytes: AVX: 12 non-AVX: 9 </li> <li>fused-domain uops: 1 + 2*n (no memory ops). (Without AVX: 3*n)</li> <li>latency added to a dep chain: 4c (Skylake: 5c)</li> <li>throughput: 1/c (saturate p1 and p5). Skylake: 3/2c: (3 vector uops/cycle) / (uop_p01 + uop_p015).</li> <li>"latency" if <code>xmm0</code> was ready when this sequence issued (no loop): same</li> </ul> This should work, but IDK either what happens with NaN. Nice observation that ANDPS is lower latency and doesn't require the FPU add port. This is the smallest size with non-AVX. <hr> <h3>4: shift left/right:</h3> <pre class="prettyprint"><code>PSLLD xmm0, 1 PSRLD xmm0, 1 </code></pre> <ul> <li>bytes: 10 (AVX: 10)</li> <li>fused-domain uops: 2*n</li> <li>latency added to a dep chain: 4c (2c + bypass delays)</li> <li>throughput: 1/2c (saturate p0, also used by FP mul). (Skylake 1/c: doubled vector shift throughput)</li> <li> "latency" if <code>xmm0</code> was ready when this sequence issued (no loop): same This is the smallest (in bytes) with AVX. This has possibilities where you can't spare a register, and it isn't used in a loop. (In loop with no regs to spare, prob. use <code>andps xmm0, [mask]</code>). </li> </ul> I assume there's a 1c bypass delay from FP to integer-shift, and then another 1c on the way back, so this is as slow as SUBPS/ANDPS. It does save a no-execution-port uop, so it has advantages if fused-domain uop throughput is an issue, and you can't pull mask-generation out of a loop. (e.g. because this is in a function that's called in a loop, not inlined). <hr> When to use what: Loading the mask from memory makes the code simple, but has the risk of a cache miss. And takes up 16B of ro-data instead of 9 instruction bytes. <ul> <li>Needed in a loop: 1c: Generate the mask outside the loop (with pcmp/shift); use a single <code>andps</code> inside. If you can't spare the register, spill it to the stack and 1a: <code>andps xmm0, [rsp + mask_local]</code>. (Generating and storing is less likely to lead to a cache miss than a constant). Only adds 1 cycle to the critical path either way, with 1 single-uop instruction inside the loop. It's a port5 uop, so if your loop saturates the shuffle port and isn't latency-bound, <code>PAND</code> might be better. (SnB/IvB have shuffles units on p1/p5, but Haswell/Broadwell/Skylake can only shuffle on p5. Skylake did increase the throughput for <code>(V)(P)BLENDV</code>, but not other shuffle-port ops. If the AIDA numbers are right, non-AVX BLENDV is 1c lat ~3/c tput, but AVX BLENDV is 2c lat, 1/c tput (still a tput improvement over Haswell))</li> <li> Needed once in a frequently-called non-looping function (so you can't amortize mask generation over multiple uses): <ol> <li>If uop throughput is an issue: 1a: <code>andps xmm0, [mask]</code>. The occasional cache-miss should be amortized over the savings in uops, if that really was the bottleneck.</li> <li>If latency isn't an issue (the function is only used as part of short non-loop-carried dep chains, e.g. <code>arr[i] = abs(2.0 + arr[i]);</code>), and you want to avoid the constant in memory: 4, because it's only 2 uops. If <code>abs</code> comes at the start or end of a dep chain, there won't be a bypass delay from a load or to a store.</li> <li>If uop throughput isn't an issue: 1c: generate on the fly with integer <code>pcmpeq / shift</code>. No cache miss possible, and only adds 1c to the critical path.</li> </ol> </li> <li>Needed (outside any loops) in an infrequently-called function: Just optimize for size (neither small version uses a constant from memory). non-AVX: 3. AVX: 4. They're not bad, and can't cache-miss. 4 cycle latency is worse for the critical path than you'd get with version 1c, so if you don't think 3 instruction bytes is a big deal, pick 1c. Version 4 is interesting for register pressure situations when performance isn't important, and you'd like to avoid spilling anything.</li> </ul> <hr> <ul> <li>AMD CPUs: There's a bypass delay to/from <code>ANDPS</code> (which by itself has 2c latency), but I think it's still the best choice. It still beats the 5-6 cycle latency of <code>SUBPS</code>. <code>MAXPS</code> is 2c latency. With the high latencies of FP ops on Bulldozer-family CPUs, you're even more likely for out-of-order execution to be able to generate your mask on the fly in time for it to be ready when the other operand to <code>ANDPS</code> is. I'm guessing Bulldozer through Steamroller don't have a separate FP add unit, and instead do vector adds and multiplies in the FMA unit. 3 will always be a bad choice on AMD Bulldozer-family CPUs. 2 looks better in that case, because of a shorter bypass delay from the fma domain to the fp domain and back. See Agner Fog's microarch guide, pg 182 (15.11 Data delay between different execution domains).</li> <li>Silvermont: Similar latencies to SnB. Still go with 1c for loops, and prob. also for one-time use. Silvermont is out-of-order, so it can get the mask ready ahead of time to still only add 1 cycle to the critical path.</li> </ul>

Fastest way to compute absolute value using SSE

Tags:

x86

vectorization

simd

sse

absolute-value

I am aware of 3 methods, but as far as I know, only the first 2 are generally used:

Mask off the sign bit using andps or andnotps.
- Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop.
- Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch.
Subtract the value from zero to negate, and then get the max of the original and negated.
- Pros: Fixed cost because nothing is needed to fetch, like a mask.
- Cons: Will always be slower than the mask method if conditions are ideal, and we have to wait for the subps to complete before using the maxps instruction.
Similar to option 2, subtract the original value from zero to negate, but then "bitwise and" the result with the original using andps. I ran a test comparing this to method 2, and it seems to behave identically to method 2, aside from when dealing with NaNs, in which case the result will be a different NaN than method 2's result.
- Pros: Should be slightly faster than method 2 because andps is usually faster than maxps.
- Cons: Can this result in any unintended behavior when NaNs are involved? Maybe not, because a NaN is still a NaN, even if it's a different value of NaN, right?

Thoughts and opinions are welcome.

758

asked Sep 05 '15 01:09

Kumputer

1 Answers

TL;DR: In almost all cases, use pcmpeq/shift to generate a mask, and andps to use it. It has the shortest critical path by far (tied with constant-from-memory), and can't cache-miss.

How to do that with intrinsics

Getting the compiler to emit pcmpeqd on an uninitialized register can be tricky. (godbolt). The best way for gcc / icc looks to be

__m128 abs_mask(void){
  // with clang, this turns into a 16B load,
  // with every calling function getting its own copy of the mask
  __m128i minus1 = _mm_set1_epi32(-1);
  return _mm_castsi128_ps(_mm_srli_epi32(minus1, 1));
}
// MSVC is BAD when inlining this into loops
__m128 vecabs_and(__m128 v) {
  return _mm_and_ps(abs_mask(), v);
}


__m128 sumabs(const __m128 *a) { // quick and dirty no alignment checks
  __m128 sum = vecabs_and(*a);
  for (int i=1 ; i < 10000 ; i++) {
      // gcc, clang, and icc hoist the mask setup out of the loop after inlining
      // MSVC doesn't!
      sum = _mm_add_ps(sum, vecabs_and(a[i])); // one accumulator makes addps latency the bottleneck, not throughput
  }
  return sum;
}

clang 3.5 and later "optimizes" the set1 / shift into loading a constant from memory. It will use pcmpeqd to implement set1_epi32(-1), though. TODO: find a sequence of intrinsics that produces the desired machine code with clang. Loading a constant from memory isn't a performance disaster, but having every function use a different copy of the mask is pretty terrible.

MSVC: VS2013:

_mm_uninitialized_si128() is not defined.
_mm_cmpeq_epi32(self,self) on an uninitialized variable will emit a movdqa xmm, [ebp-10h] in this test case (i.e. load some uninitialized data from the stack. This has less risk of a cache miss than just loading the final constant from memory. However, Kumputer says MSVC didn't manage to hoist the pcmpeqd / psrld out of the loop (I assume when inlining vecabs), so this is unusable unless you manually inline and hoist the constant out of a loop yourself.
Using _mm_srli_epi32(_mm_set1_epi32(-1), 1) results in a movdqa to load a vector of all -1 (hoisted outside the loop), and a psrld inside the loop. So that's completely horrible. If you're going to load a 16B constant, it should be the final vector. Having integer instructions generating the mask every loop iteration is also horrible.

Suggestions for MSVC: Give up on generating the mask on the fly, and just write

const __m128 absmask = _mm_castsi128_ps(_mm_set1_epi32(~(1<<31));

Probably you'll just get the mask stored in memory as a 16B constant. Hopefully not duplicated for every function that uses it. Having the mask in a memory constant is more likely to be helpful in 32bit code, where you only have 8 XMM registers, so vecabs can just ANDPS with a memory source operand if it doesn't have a register free to keep a constant lying around.

TODO: find out how to avoid duplicating the constant everywhere it's inlined. Probably using a global constant, rather than an anonymous set1, would be good. But then you need to initialize it, but I'm not sure intrinsics work as initializers for global __m128 variables. You want it to go in the read-only data section, not to have a constructor that runs at program startup.

Alternatively, use

__m128i minus1;  // undefined
#if _MSC_VER && !__INTEL_COMPILER
minus1 = _mm_setzero_si128();  // PXOR is cheaper than MSVC's silly load from the stack
#endif
minus1 = _mm_cmpeq_epi32(minus1, minus1);  // or use some other variable here, which will probably cost a mov insn without AVX, unless the variable is dead.
const __m128 absmask = _mm_castsi128_ps(_mm_srli_epi32(minus1, 1));

The extra PXOR is quite cheap, but it's still a uop and still 4 bytes on code size. If anyone has any better solution to overcome MSVC's reluctance to emit the code we want, leave a comment or edit. This is no good if inlined into a loop, though, because the pxor/pcmp/psrl will all be inside the loop.

Loading a 32bit constant with movd and broadcasting with shufps might be ok (again, you probably have to manually hoist this out of a loop, though). That's 3 instructions (mov-immediate to a GP reg, movd, shufps), and movd is slow on AMD where the vector unit is shared between two integer cores. (Their version of hyperthreading.)

Choosing the best asm sequence

Ok, lets look at this for let's say Intel Sandybridge through Skylake, with a bit of mention of Nehalem. See Agner Fog's microarch guides and instruction timings for how I worked this out. I also used Skylake numbers someone linked in a post on the http://realwordtech.com/ forums.

Lets say the vector we want to abs() is in xmm0, and is part of a long dependency chain as is typical for FP code.

So lets assume any operations that don't depend on xmm0 can begin several cycles before xmm0 is ready. I've tested, and instructions with memory operands don't add extra latency to a dependency chain, assuming the address of the memory operand isn't part of the dep chain (i.e. isn't part of the critical path).

I'm not totally clear on how early a memory operation can start when it's part of a micro-fused uop. As I understand it, the Re-Order Buffer (ROB) works with fused uops, and tracks uops from issue to retirement (168(SnB) to 224(SKL) entries). There's also a scheduler that works in the unfused domain, holding only uops that have their input operands ready but haven't yet executed. uops can issue into the ROB (fused) and scheduler (unfused) at the same time when they're decoded (or loaded from the uop cache). If I'm understanding this correctly, it's 54 to 64 entries in Sandybridge to Broadwell, and 97 in Skylake. There's some unfounded speculation about it not being a unified (ALU/load-store) scheduler anymore.

There's also talk of Skylake handling 6 uops per clock. As I understand it, Skylake will read whole uop-cache lines (up to 6 uops) per clock into a buffer between the uop cache and the ROB. Issue into the ROB/scheduler is still 4-wide. (Even nop is still 4 per clock). This buffer helps where code alignment / uop cache line boundaries cause bottlenecks for previous Sandybridge-microarch designs. I previously thought this "issue queue" was this buffer, but apparently it isn't.

However it works, the scheduler is large enough to get the data from cache ready in time, if the address isn't on the critical path.

1a: mask with a memory operand

ANDPS  xmm0, [mask]  # in the loop

bytes: 7 insn, 16 data. (AVX: 8 insn)
fused-domain uops: 1 * n
latency added to critical path: 1c (assuming L1 cache hit)
throughput: 1/c. (Skylake: 2/c) (limited by 2 loads / c)
"latency" if xmm0 was ready when this insn issued: ~4c on an L1 cache hit.

1b: mask from a register

movaps   xmm5, [mask]   # outside the loop

ANDPS    xmm0, xmm5     # in a loop
# or PAND   xmm0, xmm5    # higher latency, but more throughput on Nehalem to Broadwell

# or with an inverted mask, if set1_epi32(0x80000000) is useful for something else in your loop:
VANDNPS   xmm0, xmm5, xmm0   # It's the dest that's NOTted, so non-AVX would need an extra movaps

bytes: 10 insn + 16 data. (AVX: 12 insn bytes)
fused-domain uops: 1 + 1*n
latency added to a dep chain: 1c (with the same cache-miss caveat for early in the loop)
throughput: 1/c. (Skylake: 3/c)

PAND is throughput 3/c on Nehalem to Broadwell, but latency=3c (if used between two FP-domain operations, and even worse on Nehalem). I guess only port5 has the wiring to forward bitwise ops directly to the other FP execution units (pre Skylake). Pre-Nehalem, and on AMD, bitwise FP ops are treated identically to integer FP ops, so they can run on all ports, but have a forwarding delay.

1c: generate the mask on the fly:

# outside a loop
PCMPEQD  xmm5, xmm5  # set to 0xff...  Recognized as independent of the old value of xmm5, but still takes an execution port (p1/p5).
PSRLD    xmm5, 1     # 0x7fff...  # port0
# or PSLLD xmm5, 31  # 0x8000...  to set up for ANDNPS

ANDPS    xmm0, xmm5  # in the loop.  # port5

bytes: 12 (AVX: 13)
fused-domain uops: 2 + 1*n (no memory ops)
latency added to a dep chain: 1c
throughput: 1/c. (Skylake: 3/c)
throughput for all 3 uops: 1/c saturating all 3 vector ALU ports
"latency" if xmm0 was ready when this sequence issued (no loop): 3c (+1c possible bypass delay on SnB/IvB if ANDPS has to wait for integer data to be ready. Agner Fog says in some cases there's no extra delay for integer->FP-boolean on SnB/IvB.)

This version still takes less memory than versions with a 16B constant in memory. It's also ideal for an infrequently-called function, because there's no load to suffer a cache miss.

The "bypass delay" shouldn't be an issue. If xmm0 is part of a long dependency chain, the mask-generating instructions will execute well ahead of time, so the integer result in xmm5 will have time to reach ANDPS before xmm0 is ready, even if it takes the slow lane.

Haswell has no bypass delay for integer results -> FP boolean, according to Agner Fog's testing. His description for SnB/IvB says this is the case with the outputs of some integer instructions. So even in the "standing start" beginning-of-a-dep-chain case where xmm0 is ready when this instruction sequence issues, it's only 3c on *well, 4c on *Bridge. Latency probably doesn't matter if the execution units are clearing the backlog of uops as fast as they're being issued.

Either way, ANDPS's output will be in the FP domain, and have no bypass delay if used in MULPS or something.

On Nehalem, bypass delays are 2c. So at the start of a dep chain (e.g. after a branch mispredict or I$ miss) on Nehalem, "latency" if xmm0 was ready when this sequence issued is 5c. If you care a lot about Nehalem, and expect this code to be the first thing that runs after frequent branch mispredicts or similar pipeline stalls that leaves the OoOE machinery unable to get started on calculating the mask before xmm0 is ready, then this might not be the best choice for non-loop situations.

2a: AVX max(x, 0-x)

VXORPS  xmm5, xmm5, xmm5   # outside the loop

VSUBPS  xmm1, xmm5, xmm0   # inside the loop
VMAXPS  xmm0, xmm0, xmm1

bytes: AVX: 12
fused-domain uops: 1 + 2*n (no memory ops)
latency added to a dep chain: 6c (Skylake: 8c)
throughput: 1 per 2c (two port1 uops). (Skylake: 1/c, assuming MAXPS uses the same two ports as SUBPS.)

Skylake drops the separate vector-FP add unit, and does vector adds in the FMA units on ports 0 and 1. This doubles FP add throughput, at the cost of 1c more latency. The FMA latency is down to 4 (from 5 in *well). x87 FADD is still 3 cycle latency, so there's still a 3-cycle scalar 80bit-FP adder, but only on one port.

2b: same but without AVX:

# inside the loop
XORPS  xmm1, xmm1   # not on the critical path, and doesn't even take an execution unit on SnB and later
SUBPS  xmm1, xmm0
MAXPS  xmm0, xmm1

bytes: 9
fused-domain uops: 3*n (no memory ops)
latency added to a dep chain: 6c (Skylake: 8c)
throughput: 1 per 2c (two port1 uops). (Skylake: 1/c)
"latency" if xmm0 was ready when this sequence issued (no loop): same

Zeroing a register with a zeroing-idiom that the processor recognizes (like xorps same,same) is handled during register rename on Sandbridge-family microarchitectures, and has zero latency, and throughput of 4/c. (Same as reg->reg moves that IvyBridge and later can eliminate.)

It's not free, though: It still takes a uop in the fused domain, so if your code is only bottlenecked by the 4uop/cycle issue rate, this will slow you down. This is more likely with hyperthreading.

3: ANDPS(x, 0-x)

VXORPS  xmm5, xmm5, xmm5   # outside the loop.  Without AVX: zero xmm1 inside the loop

VSUBPS  xmm1, xmm5, xmm0   # inside the loop
VANDPS  xmm0, xmm0, xmm1

bytes: AVX: 12 non-AVX: 9
fused-domain uops: 1 + 2*n (no memory ops). (Without AVX: 3*n)
latency added to a dep chain: 4c (Skylake: 5c)
throughput: 1/c (saturate p1 and p5). Skylake: 3/2c: (3 vector uops/cycle) / (uop_p01 + uop_p015).
"latency" if xmm0 was ready when this sequence issued (no loop): same

This should work, but IDK either what happens with NaN. Nice observation that ANDPS is lower latency and doesn't require the FPU add port.

This is the smallest size with non-AVX.

4: shift left/right:

PSLLD  xmm0, 1
PSRLD  xmm0, 1

bytes: 10 (AVX: 10)
fused-domain uops: 2*n
latency added to a dep chain: 4c (2c + bypass delays)
throughput: 1/2c (saturate p0, also used by FP mul). (Skylake 1/c: doubled vector shift throughput)
"latency" if xmm0 was ready when this sequence issued (no loop): same

This is the smallest (in bytes) with AVX.

This has possibilities where you can't spare a register, and it isn't used in a loop. (In loop with no regs to spare, prob. use andps xmm0, [mask]).

I assume there's a 1c bypass delay from FP to integer-shift, and then another 1c on the way back, so this is as slow as SUBPS/ANDPS. It does save a no-execution-port uop, so it has advantages if fused-domain uop throughput is an issue, and you can't pull mask-generation out of a loop. (e.g. because this is in a function that's called in a loop, not inlined).

When to use what: Loading the mask from memory makes the code simple, but has the risk of a cache miss. And takes up 16B of ro-data instead of 9 instruction bytes.

Needed in a loop: 1c: Generate the mask outside the loop (with pcmp/shift); use a single andps inside. If you can't spare the register, spill it to the stack and 1a: andps xmm0, [rsp + mask_local]. (Generating and storing is less likely to lead to a cache miss than a constant). Only adds 1 cycle to the critical path either way, with 1 single-uop instruction inside the loop. It's a port5 uop, so if your loop saturates the shuffle port and isn't latency-bound, PAND might be better. (SnB/IvB have shuffles units on p1/p5, but Haswell/Broadwell/Skylake can only shuffle on p5. Skylake did increase the throughput for (V)(P)BLENDV, but not other shuffle-port ops. If the AIDA numbers are right, non-AVX BLENDV is 1c lat ~3/c tput, but AVX BLENDV is 2c lat, 1/c tput (still a tput improvement over Haswell))
Needed once in a frequently-called non-looping function (so you can't amortize mask generation over multiple uses):
1. If uop throughput is an issue: 1a: andps xmm0, [mask]. The occasional cache-miss should be amortized over the savings in uops, if that really was the bottleneck.
2. If latency isn't an issue (the function is only used as part of short non-loop-carried dep chains, e.g. arr[i] = abs(2.0 + arr[i]);), and you want to avoid the constant in memory: 4, because it's only 2 uops. If abs comes at the start or end of a dep chain, there won't be a bypass delay from a load or to a store.
3. If uop throughput isn't an issue: 1c: generate on the fly with integer pcmpeq / shift. No cache miss possible, and only adds 1c to the critical path.
Needed (outside any loops) in an infrequently-called function: Just optimize for size (neither small version uses a constant from memory). non-AVX: 3. AVX: 4. They're not bad, and can't cache-miss. 4 cycle latency is worse for the critical path than you'd get with version 1c, so if you don't think 3 instruction bytes is a big deal, pick 1c. Version 4 is interesting for register pressure situations when performance isn't important, and you'd like to avoid spilling anything.

AMD CPUs: There's a bypass delay to/from ANDPS (which by itself has 2c latency), but I think it's still the best choice. It still beats the 5-6 cycle latency of SUBPS. MAXPS is 2c latency. With the high latencies of FP ops on Bulldozer-family CPUs, you're even more likely for out-of-order execution to be able to generate your mask on the fly in time for it to be ready when the other operand to ANDPS is. I'm guessing Bulldozer through Steamroller don't have a separate FP add unit, and instead do vector adds and multiplies in the FMA unit. 3 will always be a bad choice on AMD Bulldozer-family CPUs. 2 looks better in that case, because of a shorter bypass delay from the fma domain to the fp domain and back. See Agner Fog's microarch guide, pg 182 (15.11 Data delay between different execution domains).
Silvermont: Similar latencies to SnB. Still go with 1c for loops, and prob. also for one-time use. Silvermont is out-of-order, so it can get the mask ready ahead of time to still only add 1 cycle to the critical path.

answered Sep 21 '22 02:09

Peter Cordes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to compute absolute value using SSE

Tags:

x86

vectorization

simd

sse

absolute-value

Kumputer

People also ask

1 Answers

How to do that with intrinsics

Choosing the best asm sequence

1a: mask with a memory operand

1b: mask from a register

1c: generate the mask on the fly:

2a: AVX max(x, 0-x)

2b: same but without AVX:

3: ANDPS(x, 0-x)

4: shift left/right:

Peter Cordes

Recent Activity

Donate For Us

Fastest way to compute absolute value using SSE

Tags:

x86

vectorization

simd

sse

absolute-value

Kumputer

People also ask

1 Answers

How to do that with intrinsics

Choosing the best asm sequence

1a: mask with a memory operand

1b: mask from a register

1c: generate the mask on the fly:

2a: AVX max(x, 0-x)

2b: same but without AVX:

3: ANDPS(x, 0-x)

4: shift left/right:

Peter Cordes

Related questions

Recent Activity

Donate For Us