Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SIMD search for trough after the last peak

I need to find the index of the value that is X or more % below the last rolling maximum peak.

The peak is a rolling maximum of the elements in one array (highs), while the values are in another array (lows). Arrays have equal length and values are guaranteed to be <= corresponding items in the peaks array, there are no 0, NAN, or infinity elements. since is guaranteed to be less than till.

Iterative implementation is straightforward:

inline
size_t trail_max_intern(double *highs,
        double *lows,
        double max,
        double trail,
        size_t since,
        size_t till)
{
    for (; since < till; ++since) {
        if (max < highs[since]) {
            max = highs[since];
        }

        if (lows[since] / max <= trail) {
            break;
        }
    }

    return since;
}

size_t trail_max_iter(double *highs, double *lows, double trail, size_t since, size_t till)
{
    trail = 1 - trail;
    return trail_max_intern(highs, lows, highs[since], trail, since, till);
}

This is basically a linear search with a somewhat modified condition. Because of the task context (arbitrary since, till, and trail values), no other structure or algorithm can be used.

To get some speedup I thought to use AVX2 vectorization extension and see what happens. My result looks like this:

size_t trail_max_vec(double *highs, double *lows, double trail, size_t since, size_t till)
{
    double max = highs[since];
    trail = 1 - trail;

    if (till - since > 4) {
        __m256d maxv = _mm256_set1_pd(max);
        __m256d trailv = _mm256_set1_pd(trail);

        for (size_t last = till & ~3; since < last; since += 4) {
            __m256d chunk = _mm256_loadu_pd(highs + since); // load peak block

            // peak rolling maximum computation
            maxv = _mm256_max_pd(maxv, chunk);
            // propagating the maximum value to places 2, 3, and 4
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b10010000));
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b01000000));
            maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b00000000));

            // divide lows by rolling maximum
            __m256d res = _mm256_div_pd(_mm256_loadu_pd(lows + since), maxv);
            // and if it is lower than the given fraction return its index
            int found = _mm256_movemask_pd(_mm256_cmp_pd(res, trailv, _CMP_LE_OQ));
            if (found) {
                return since + __builtin_ctz(found);
            }

            maxv = _mm256_set1_pd(maxv[3]);
        }

        max = maxv[3];
    }

    // make sure trailing elements are seen
    return trail_max_intern(highs, lows, max, trail, since, till);
}

It yields correct results, but it is ~2 times slower than the iterative version. I'm definitely doing something wrong, but I can't figure out what.

So my question is, what is wrong with my approach and how do I fix it?

P. S. Complete source with benchmarks is available at https://godbolt.org/z/e5YrTo

like image 686
Daniel Avatar asked Jan 25 '23 12:01

Daniel


2 Answers

This creates a very long dependency chain:

        // peak rolling maximum computation
        maxv = _mm256_max_pd(maxv, chunk);
        // propagating the maximum value to places 2, 3, and 4
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b10010000));
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b01000000));
        maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, 0b00000000));

        // ...

        maxv = _mm256_set1_pd(maxv[3]);

i.e., you have 8 instructions all depending on the result of the previous instruction. Assuming each has a latency of 3 cycles, this alone requires 24 cycles for 4 elements (the remaining operations can likely happen within that time, especially if you compare lows[since] <= trail * max -- assuming max > 0).

To reduce the dependency chain, you should first compute the "local" rolling maximum within chunk and compute the maximum of that with maxv afterwards:

        chunk = _mm256_max_pd(chunk, _mm256_movedup_pd(chunk));                  // [c0, c0c1, c2, c2c3]
        chunk = _mm256_max_pd(chunk, _mm256_permute4x64_pd(chunk, 0b01010100));  // [c0, c0c1, c0c1c2, c0c1c2c3]

        __m256d max_local = _mm256_max_pd(maxv, chunk);

To compute the next maxv you can either broadcast max_local[3] (giving a total latency of ~6 cycles), or first broadcast chunk[3] and compute the maximum of that with maxv (this would leave just one maxpd in the dependency chain, and you would clearly be limited by throughput). Benchmarking this on godbolt produced to much noise to decide which one is better in your case.

Furthermore, you can consider working on larger chunks (i.e., load two consecutive __m256ds, compute a locale rolling maximum for these, etc)

like image 158
chtz Avatar answered Feb 03 '23 07:02

chtz


Especially in the scalar version, I'd calculate max * trail once when finding a new max, then the other condition is just lows[since] <= threshold, not involving any computation.
This is even better than replacing max / trail with max * (1./trail) (suggested in comments). Division is higher latency, and can be a bottleneck if you exceed the divider unit's limited throughput (which is worst for 256-bit double vectors).

OTOH, GCC is optimizing your current if to a maxsd, so new-max is branchless. Another alternative is if (lows[since] * (1./trail) <= max) to trick the compiler into using maxsd and mulsd every iteration, instead of conditionally skipping the multiply in if (lows[since] <= max * trail). But for your test microbenchmark on Godbolt, using a threshold seems best even though that means more branching. (But less work per iteration in the common case, although GCC isn't unrolling and it's a small test whose data might be too "easy".)


In branchless SIMD that's not helpful1 because you'd have to do all the work every time anyway, but if a new max is rare-ish (e.g. highs[] large and uniformly distributed) it could be worth branching on that, in order to do much less shuffling and maxing work in the common case. Especially if you're tuning to run on a CPU with hyperthreading / SMT; the other logical core's uops can keep the core busy while it recovers from a branch mispredict. (If new maxes are very common, though, like if highs is on average increasing, not uniform, this will be bad.)

note 1: assuming you're already following @chtz's answer, which shows how to keep most of that work off the critical path (loop carried dependency chain) latency bottleneck.


The common / fast case will be blazing through vectors just checking two conditions (highs > max or lows <= thresh). If either of those things are true, you run the "expensive" code to do the correct scan order within a vector, and then check thresholds. You do still need to correctly handle the case where high[1] is a new max and lows[1..3] are not below that new threshold, even if they were below the old max. And also the case where lows[0] is below the old threshold, but not below the new threshold. And of course multiple new maxes in one vector are possible.

        __m256d maxv = _mm256_set1_pd(max);
        __m256d thresh = _mm256_mul_pd(maxv, _mm256_set1_pd(trail));
        for() {
            __m256d vhi = _mm256_loadu_pd(highs + since);
            __m256d vlo = _mm256_loadu_pd(lows + since);
    
            __m256d vtrough = _mm256_cmp_pd(vlo, thresh, _CMP_LE_OQ);  // possible candidate
            __m256d newmax = _mm256_cmp_pd(vhi, maxv, _CMP_GT_OQ);     // definite new max

            // __m256d special_case = _mm256_or_pd(vtrough, newmax);
            // The special case needs trough separately, and it's only slightly more expensive to movmskpd twice.
            // AVX512 could use kortest between two compare results.

            unsigned trough = _mm256_movemask_pd(vtrough);
            // add/jnz can macro-fuse on Intel, unlike or/jnz, so use that to check for either one being non-zero.  AMD Zen doesn't care.
            if (trough + _mm256_movemask_pd(newmax)) {
                unsigned trough = _mm256_movemask_pd(vtrough);
                if (trough) {
                    // ... full work to verify or reject the candidate, hopefully rare.
                    // Perhaps extract this to a helper function
                    int found = ...;
                    // This becomes trivial (found = trough) in the no-new-max case, but that may be rare enough not to be worth special-casing here.
                    if (found) {
                        return since + __builtin_ctz(found);
                    }
                    maxv = _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(3,3,3,3));
                } else {
                    // just a new max, no possible trough even with the lowest threshold
                    // horizontal max-broadcast of the current chunk, replacing old maxv (no data dependency on it)
                    maxv = _mm256_max_pd(vhi, _mm256_shuffle_pd(vhi,vhi, _MM_SHUFFLE(2,3,0,1)));   // in-lane swap pairs so both hold max of that 128-bit chunk
                    maxv = _mm256_max_pd(maxv, _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(1,0,3,2)));    // swap low and high halves.  vperm2f128 would also work, but slower on Zen1.
                }

                thresh = _mm256_mul_pd(maxv, _mm256_set1_pd(trail));
            }
        }

Probably not worth further special casing the no-newmax case to just use the current maxv and thus the current trough you already computed, unless you expect finding a trough to be just as common if not more than a new max. (If this was a single array that changed gradually (not separate high and low), that would be common: rare to find a single vector that included both a new max and a trough.)

maxv = _mm256_set1_pd(maxv[3]); probably compiles efficiently, but some compilers are dumber than others when inventing shuffles to implement set1. Also, vector[] is specific to how GNU C implementations (gcc/clang) define __m256d, and isn't portable.
Use maxv = _mm256_permute4x64_pd(maxv, _MM_SHUFFLE(3,3,3,3)); to broadcast the high element with a vpermpd ymm, ymm, imm8 instruction.

Just to see if it compiled, and rough idea of speed, I put it on Godbolt (with int found = trough; which mis-handles the case where there's a new max and a possible trough candidate).
On Godbolt's Skylake-server CPUs, obviously benchmark results are noisy from load on AWS, but the vectorized version varied from 33ms to 55ms for 100 reps, vs. ~95 ms for the scalar version. Before optimizing the scalar version to use lows[since] <= thresh with updating of thresh = max*trail; in the new-max case, times for scalar bounced around from ~120 to 127ms. (Probably CPU frequency and other warm-up effects making the first one tested much more variable for such a short timed interval.)

I used -march=haswell instead of -mavx2 because GCC's default tune settings are poor for modern CPUs with 256-bit possibly-unaligned loads (good for Sandybridge, not Haswell or Zen 2): Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?.

The actual asm for the loop is efficient like I expected:

# gcc -O3 -march=haswell, inner loop
        vmovupd ymm0, YMMWORD PTR [rdi+rdx*8]
        vmovupd ymm7, YMMWORD PTR [rsi+rdx*8]
        vcmppd  ymm3, ymm0, ymm2, 30
        vcmppd  ymm1, ymm7, ymm4, 18
        vmovmskpd       ecx, ymm3
        vmovmskpd       r8d, ymm1
        add     ecx, r8d                # destroys movemask(newmax) and sets FLAGS
        je      .L16

        test    r8d, r8d
        jne     .L36                    # jump if trough, else fall through to new max.  In this hacky version, finding non-zero trough always exits the loop because it ignores the possibility of also being a new max.  The asm will *probably* look similar once handled properly, though.

        vshufpd ymm1, ymm0, ymm0, 1
        vmaxpd  ymm0, ymm0, ymm1
        vpermpd ymm2, ymm0, 78
        vmaxpd  ymm2, ymm0, ymm2       # new max broadcast into YMM2
        vmulpd  ymm4, ymm2, ymm6       # update thresh from max

.L16:
        add     rdx, 4
        cmp     r9, rdx
        ja      .L19

Note that the common case (just comparing) has no loop-carried data dependency (except the pointer increment).

The new-max case also has no data dependency on the old max, only a control dependency (which branch prediction + speculative execution can speculate through, if we're seeing a predictable run of vectors with a new max but not a trough candidate.)

So the complex case code that handles a trough candidate doesn't have to worry too much about the length of the data dependencies, although the sooner a new max is ready the sooner out-of-order exec can start checking branch conditions on later iterations.


Also note that you can safely do the last 0..2 iteration with SIMD if you want, because it's safe to redo elements you've already looked at. max is monotonically increasing (and thus so is threshold) so you'll never accept a lows[] that you shouldn't have. So you can do one final vector that ends at the last element of your range, as long as the total size is >= 4 so you don't read outside of the array.

In this case it may not be worth it (because each SIMD iteration is so much more expensive than scalar), except maybe just doing a quick check to see if there's a possible new max and/or low candidate before running scalar cleanup.

like image 36
Peter Cordes Avatar answered Feb 03 '23 06:02

Peter Cordes