What do you do without fast gather and scatter in AVX2 instructions?

Tags:

I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter instructions but I'm limited to AVX2 hardware for a x86 architecture (I know AVX-512 has these though I'd not sure how fast they are).

#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // 3x primes
    __m256i sum2 = _mm256_set_epi64x(201, 213, 219, 237);   // This aren't always the same

    // Actually algorithm can changes these
    __m256i mod1 = _mm256_set1_epi64x(1);
    __m256i mod3 = _mm256_set1_epi64x(1);

    __m256i mod2, mod4, sum3;

    // Sieve until all factors (start under 32-bit threshold) exceed the limit
    do {
        // Sieve until one of the factors exceeds the limit
        do {
            // Compiler does a nice job converting these into extracts
            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod1, _mm256_and_si256(mask, ans));

            ans = _mm256_add_epi64(ans, sum);

            // Early on these locations can overlap
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod2 = _mm256_sub_epi64(total, ans);

            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans2), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod3, _mm256_and_si256(mask, ans2));

            ans2 = _mm256_add_epi64(ans2, sum2);

            // Two types of candidates are being performed at once
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod4 = _mm256_sub_epi64(total, ans2);
        } while (!_mm256_movemask_pd(_mm256_castsi256_pd(_mm256_or_si256(mod2, mod4))));

        // Remove one factor
        mod2 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum), _mm256_castsi256_pd(mod2)));
        mod4 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum2), _mm256_castsi256_pd(mod4)));
        ans = _mm256_sub_epi64(ans, mod2);
        ans2 = _mm256_sub_epi64(ans2, mod4);
        sum = _mm256_sub_epi64(sum, mod2);
        sum2 = _mm256_sub_epi64(sum2, mod4);
        sum3 = _mm256_or_si256(sum, sum2);
     } while (!_mm256_testz_si256(sum3, sum3));
#else
     // Just some example values (not really constant - compiler will optimize away code incorrectly)
     uint64_t cur = 58;
     uint64_t cur2 = 142;
     uint64_t factor = 67;

     if (cur < cur2) {
        std::swap(cur, cur2);
    }
    while (cur < totalX) {
        sieveX[cur >> 6] |= (1ULL << (cur & 0x3f));
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur += factor;
        cur2 += factor;
    }
    while (cur2 < totalX) {
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur2 += factor;
    }
#endif
}

Be warned that the locations can overlap at first. After a short while in the loop, this is not the case. I'd be happy to using a different approach if this is possible. Around 82% of the time within this part of the algorithm is in this loop. Hopefully this isn't too close to other posted questions.

763

asked Jul 02 '18 00:07

ChipK

1 Answers

IDK why you use different parts of the same cur[8] array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.

It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX[]; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.

AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.

Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.

See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.

With AVX, you have 3 main options:

do everything scalar. Register pressure may be a problem, but generating indices as needed (instead of doing all 4 adds or subs to generate curr[4..7] at once) might help. Unless those mask vectors have different values in different elements.

(Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)

Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.

extract indices and values to scalar like you're doing now, so the OR into sieveX[] is pure scalar. Works even when two or more indices are the same.

This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile. And BTW, you'd want to use alignas(32) cur[8] to make sure actual vector stores don't cross a cache-line boundary.

or [rdi + rax*8], rdx (with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli by 3 instead of 6, mask off the low 3 bits (vpand), and vpaddq with set1_epi64(sieveX). So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t* pointer elements instead of uint64_t indices. Or if sieveX can be a 32-bit absolute address¹, you could skip the vpaddq and extract already-scaled indices for the same gain.)

It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.

This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.

extract indices, manually gather into a vector with vmovq / vmovhps for a SIMD vpor, then scatter back with vmovq / vmovhps.

Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).

See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128 / vextracti128.

Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64 result so you have (the data that would be) cur[4..5] and cur[6..7] in two separate __m128i variables. You'd have vextracti128 / 2x vpor xmm instead of vinserti128 / vpor ymm / vextracti128.

The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.

Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps²). See the previous bullet point. But vmovq loads/stores are always a single uop, and vmovhps indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.

AVX2 gives us one new option:

AVX2 vpgatherqq for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).

Possibly a win on Skylake (where vpgatherqq ymm is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.

Total cost for each strategy on Skylake

It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1 is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)

So your SIMD calculation where you need to frequently extract a vector to scalar is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.

extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.
vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake)
2x vextracti128 (2 uops for port 5)
4x vmovq (4 p0)
4x vpextrq (8: 4p0 4p5)
4x or [r], r (4x2 = 8 front-end uops each. backend: 4p0156 4p23 (load) 4p237 (store-addres) 4p4 (store-data)). Non-indexed addressing mode.

Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)

manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.
(optional, probably not worth it): vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake) Skip these if you could use non-VEX movhps for a 1-uop micro-fused indexed load. (But then p237 stores become p23).
vextracti128 pointers (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
2x vmovq load (2p23)
2x vmovhps xmm, xmm, [r] non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)
vextracti128 split the data (p5)
2x vpor xmm (2p015)
2x vmovq store (2x 1 micro-fused uop, 2p237 + 2p4)
2x vmovhps store (2x 1 micro-fused uop, 2p237 + 2p4)

Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64 is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.

Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)

Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).
vextracti128 indices (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
vpcmpeqd ymm0,ymm0,ymm0 create an all-ones mask for vpgatherqq (p015)
vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4 uops for some ports.
vpor ymm (p015)
vextracti128 on the OR result (p5)
2x vmovq store (2x 1 micro-fused uop, 2p23 + 2p4). Note no port7, we're using indexed stores.
2x vmovhps store (2x 1 micro-fused uop, 2p23 + 2p4).

So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.

We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.

Overall: One loop until we're done with conflicts, then a SIMD gather loop

You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2].

You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq is micro-code and not fast. (KNL has single-uop vpconflictq but it's still faster to avoid it entirely).

I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.

You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.

There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.

So if you can reliably detect when you're done with conflicts, use something like

int conflictcheck = 10;

do {

    if (--conflictcheck == 0) {
       vector stuff to check for conflicts
       if (no conflicts now or in the future)
           break;

       conflictcheck = 10;  // reset the down-counter
    }

    main loop body,  extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
    main loop body, gather + manual scatter strategy
} while();

That should compile to a dec / je which only costs 1 uop in the not-taken case.

Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.

Footnote 1:

If sieveX is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32 as part of a [reg+disp32] addressing mode. In that case you can leave out the vpaddq. But getting a compiler to treat a uint64_t as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX to uintptr_t and add, then cast back.

This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8] only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32 is just base + disp32.

Footnote 2: Fun fact: non-VEX movhps loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.

IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.

This is basically a missed-optimization for -mtune=skylake, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.

The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf to count uops, with this loop:

    mov     r15d, 100000000

.loop:
    vpaddq  ymm0, ymm1, ymm2      ; dirty the upper part
    vpaddq  ymm3, ymm1, ymm2      ; dirty another register for good measure

    vmovq  xmm0, [rdi+rbx*8]       ; zero the full register, breaking dependencies
    movhps xmm0, [rdi+rbx*8+8]     ; RMW the low 128 bits
                          ; fast on Skylake, will stall on Haswell

    dec r15d
    jnz .loop

The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any), 6 unfused-domain uops (uops_executed.thread). So micro-fusion was definitely happening for movhps without any SSE/AVX problems.

Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8] slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.

There's no extra uop if the upper half of ymm0 is dirty when movhps xmm0, [mem] runs. I tested by commenting out the vmovq. But changing vmovq to movq does result in an extra uop: movq becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps).

Also note that pinsrq xmm0, [mem], 1 can't micro fuse even without VEX. But with VEX, you should prefer vmovhps for code-size reasons.

Your compiler may want to "optimize" the intrinsic for movhps on integer data into vpinsrq, though, I didn't check.

154

answered Sep 18 '22 20:09

Peter Cordes

Related questions
                            
                                How can I prove the "Six Degrees of Separation" concept programmatically?
                            
                                Fast algorithm for polar -> cartesian conversion
                            
                                Linked List insertion running time confusion
                            
                                Implementing a popularity algorithm in Django
                            
                                Algorithm to find most efficient moves to arrive at a given point
                            
                                Algorithm to find added/removed elements in an array
                            
                                How is Manhattan distance an admissible heuristic?
                            
                                Single destination shortest-path in a graph
                            
                                Finding the common ancestor in a binary tree
                            
                                Efficient way to generate id unique to class?
                            
                                Level order insertion into a binary tree?
                            
                                Expected running time vs. worst-case running time
                            
                                Generate white noise image in C#
                            
                                Rabin-Karp Algorithm
                            
                                What is the fastest way to compute large power of 2 modulo a number
                            
                                Finding a Eulerian Tour
                            
                                AVL tree balance
                            
                                Finding Minimum Distance Between Words in An Array
                            
                                How to Find the Branching Factor of a Tree
                            
                                Longest Common Subsequence in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What do you do without fast gather and scatter in AVX2 instructions?

Tags:

performance

algorithm

optimization

simd

avx2