I have a piece of code that is being run under a heavily contended lock, so it needs to be as fast as possible. The code is very simple - it's a basic multiply-add on a bunch of data which looks like this:
for( int i = 0; i < size; i++ )
{
c[i] += (double)a[i] * (double)b[i];
}
Under -O3 with enabled SSE support the code is being vectorized as I would expect it to be. However, with AVX code generation turned on I get about 10-15% slowdown instead of speedup, and I can't figure out why.
Here's the benchmark code:
#include <chrono>
#include <cstdio>
#include <cstdlib>
int main()
{
int size = 1 << 20;
float *a = new float[size];
float *b = new float[size];
double *c = new double[size];
for (int i = 0; i < size; i++)
{
a[i] = rand();
b[i] = rand();
c[i] = rand();
}
for (int j = 0; j < 10; j++)
{
auto begin = std::chrono::high_resolution_clock::now();
for( int i = 0; i < size; i++ )
{
c[i] += (double)a[i] * (double)b[i];
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
printf("%lluus\n", duration);
}
}
Here's the generated assembly under SSE:
0x100007340 <+144>: cvtps2pd (%r13,%rbx,4), %xmm0
0x100007346 <+150>: cvtps2pd 0x8(%r13,%rbx,4), %xmm1
0x10000734c <+156>: cvtps2pd (%r15,%rbx,4), %xmm2
0x100007351 <+161>: mulpd %xmm0, %xmm2
0x100007355 <+165>: cvtps2pd 0x8(%r15,%rbx,4), %xmm0
0x10000735b <+171>: mulpd %xmm1, %xmm0
0x10000735f <+175>: movupd (%r14,%rbx,8), %xmm1
0x100007365 <+181>: addpd %xmm2, %xmm1
0x100007369 <+185>: movupd 0x10(%r14,%rbx,8), %xmm2
0x100007370 <+192>: addpd %xmm0, %xmm2
0x100007374 <+196>: movupd %xmm1, (%r14,%rbx,8)
0x10000737a <+202>: movupd %xmm2, 0x10(%r14,%rbx,8)
0x100007381 <+209>: addq $0x4, %rbx
0x100007385 <+213>: cmpq $0x100000, %rbx ; imm = 0x100000
0x10000738c <+220>: jne 0x100007340 ; <+144> at main.cpp:26:20
Results from running SSE benchmark:
1411us
1246us
1243us
1267us
1242us
1237us
1246us
1242us
1250us
1229us
Generated assembly with AVX enabled:
0x1000070b0 <+144>: vcvtps2pd (%r13,%rbx,4), %ymm0
0x1000070b7 <+151>: vcvtps2pd 0x10(%r13,%rbx,4), %ymm1
0x1000070be <+158>: vcvtps2pd 0x20(%r13,%rbx,4), %ymm2
0x1000070c5 <+165>: vcvtps2pd 0x30(%r13,%rbx,4), %ymm3
0x1000070cc <+172>: vcvtps2pd (%r15,%rbx,4), %ymm4
0x1000070d2 <+178>: vmulpd %ymm4, %ymm0, %ymm0
0x1000070d6 <+182>: vcvtps2pd 0x10(%r15,%rbx,4), %ymm4
0x1000070dd <+189>: vmulpd %ymm4, %ymm1, %ymm1
0x1000070e1 <+193>: vcvtps2pd 0x20(%r15,%rbx,4), %ymm4
0x1000070e8 <+200>: vcvtps2pd 0x30(%r15,%rbx,4), %ymm5
0x1000070ef <+207>: vmulpd %ymm4, %ymm2, %ymm2
0x1000070f3 <+211>: vmulpd %ymm5, %ymm3, %ymm3
0x1000070f7 <+215>: vaddpd (%r14,%rbx,8), %ymm0, %ymm0
0x1000070fd <+221>: vaddpd 0x20(%r14,%rbx,8), %ymm1, %ymm1
0x100007104 <+228>: vaddpd 0x40(%r14,%rbx,8), %ymm2, %ymm2
0x10000710b <+235>: vaddpd 0x60(%r14,%rbx,8), %ymm3, %ymm3
0x100007112 <+242>: vmovupd %ymm0, (%r14,%rbx,8)
0x100007118 <+248>: vmovupd %ymm1, 0x20(%r14,%rbx,8)
0x10000711f <+255>: vmovupd %ymm2, 0x40(%r14,%rbx,8)
0x100007126 <+262>: vmovupd %ymm3, 0x60(%r14,%rbx,8)
0x10000712d <+269>: addq $0x10, %rbx
0x100007131 <+273>: cmpq $0x100000, %rbx ; imm = 0x100000
0x100007138 <+280>: jne 0x1000070b0 ; <+144> at main.cpp:26:20
Results from running AVX benchmark:
1532us
1404us
1480us
1464us
1410us
1383us
1333us
1362us
1494us
1526us
Note that AVX code being generated with twice as many instructions as SSE doesn't really matter - I've tried smaller unroll by hand (to match SSE) and AVX was still slower.
For context, I'm using macOS 11 and Xcode 12, with Mac Pro 6.1 (trashcan) with Intel Xeon CPU E5-1650 v2 @ 3.50GHz.
Update: alignment didn't help much/at all. There may also be another bottleneck, e.g. in packed float->double conversion? Also, vcvtps2pd (%r13,%rbx,4), %ymm0
only has a 16-byte memory source, so only the stores are 32-byte. We don't have any 32-byte split loads. (I wrote the answer below before looking carefully enough at the code.)
That's an IvyBridge CPU. Is your data aligned by 32? If not, it's a known fact that cache-line splits on 32 byte loads or stores are a serious bottleneck for those old microarchitectures. Those early Intel AVX-supporting CPUs have full width ALUs, but they run 32-byte load and store as 2 separate data cycles in the execution units from the same uop1, making a cache-line split an extra special (and extra slow) case. (https://www.realworldtech.com/sandy-bridge/7/). Unlike Haswell (and Zen 2) and later which have 32-byte data paths2.
So slow that GCC's default -mtune=generic
code-generation will even split 256-bit AVX loads and stores that aren't known at compile time to be aligned. (This is way overkill and hurts especially on newer CPUs, and/or when the data actually is aligned but the compiler doesn't know it, or when data is aligned in the common case, but the function needs to still work on the occasional misaligned array, letting the hardware deal with that special case instead of running more instructions in the common case to even check for that special case.)
But you're using clang, which makes somewhat nice code here (unrolled by 4x) that would perform well with aligned arrays, or on a newer CPU like Haswell. Unfortunately it uses indexed addressing modes, defeating much of the purpose of unrolling (for Intel Sandybridge / Ivy Bridge especially) because the load and ALU uop will unlaminate and go through the front-end separately. Micro fusion and addressing modes. (Haswell can keep some of them micro-fused for the SSE case but not AVX, e.g. the stores.)
You can use aligned_alloc
, or maybe do something with C++17 aligned new
to get an aligned allocation that's compatible with delete
.
Plain new
may be giving you a pointer aligned by 16, but misaligned by 32. I don't know about MacOS, but on Linux glibc's allocator for large-ish allocations typically keeps 16 bytes for bookkeeping at the start of a page, so you typically get large allocations that are 16 bytes away from alignment by anything larger than 16.
Footnote 2: That single-uop that spends a 2nd cycle in a load port still only does address-generation once. That allows another uop (e.g. a store-address uop) to use the AGU while the 2nd data cycle is happening. So it doesn't interfere with the address-handling part being fully pipelined.
SnB / IvB only have 2 AGU/load ports, so normally can execute up to 2 memory operations per clock, up to one of which is a store. But with 32-byte loads and stores only needing an address every 2 data cycles (and store-data already being a separate uop for another port from store-address), that allows SnB / IvB to achieve 2 + 1 load+store per clock, sustained, for the special case of 32-byte loads and stores. (That uses most of the front-end bandwidth, so those loads typically need to be micro-fused as a memory source operand for another instruction.)
See also my answer on How can cache be that fast? on electronics.SE.
Footnote 1: Zen 1 (and Bulldozer-family) decode all 32-byte operations to 2 separate uops so there's no special case. One half of the load can be split across the cache line, and that would be exactly like a 16-byte load that came from an xmm
load.
If I assume that the 16MiB of data accessed effectively flushes the 12MiB LLC, then effectively all traffic is to/from DRAM. Counting the reads of all three arrays and the writeback of c, the fastest SSE time corresponds to 20.48 GB/s of DRAM bandwidth, while the fastest AVX time corresponds to 18.88 GB/s of DRAM bandwidth. These are both similar to the ~19.5 GB/s best case single-thread bandwidths (with the same R:W ratios) that I measured on a Xeon E5-2680 v2.
Two thoughts:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With