Computing 8 horizontal sums of eight AVX single-precision floating-point vectors

Question

I have 8 AVX vectors containing 8 floats each (64 floats in total) and I want to sum elements in each vector together (basically perform eight horizontal sums).

For now, I'm using the following code:

__m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
    // transpose
    const __m256 t0 = _mm256_unpacklo_ps(v0, v1);
    const __m256 t1 = _mm256_unpackhi_ps(v0, v1);
    const __m256 t2 = _mm256_unpacklo_ps(v2, v3);
    const __m256 t3 = _mm256_unpackhi_ps(v2, v3);
    const __m256 t4 = _mm256_unpacklo_ps(v4, v5);
    const __m256 t5 = _mm256_unpackhi_ps(v4, v5);
    const __m256 t6 = _mm256_unpacklo_ps(v6, v7);
    const __m256 t7 = _mm256_unpackhi_ps(v6, v7);

    __m256 v = _mm256_shuffle_ps(t0, t2, 0x4E);
    const __m256 tt0 = _mm256_blend_ps(t0, v, 0xCC);
    const __m256 tt1 = _mm256_blend_ps(t2, v, 0x33);
    v = _mm256_shuffle_ps(t1, t3, 0x4E);
    const __m256 tt2 = _mm256_blend_ps(t1, v, 0xCC);
    const __m256 tt3 = _mm256_blend_ps(t3, v, 0x33);
    v = _mm256_shuffle_ps(t4, t6, 0x4E);
    const __m256 tt4 = _mm256_blend_ps(t4, v, 0xCC);
    const __m256 tt5 = _mm256_blend_ps(t6, v, 0x33);
    v = _mm256_shuffle_ps(t5, t7, 0x4E);
    const __m256 tt6 = _mm256_blend_ps(t5, v, 0xCC);
    const __m256 tt7 = _mm256_blend_ps(t7, v, 0x33);

    // compute sums
    __m256 sum0 = _mm256_add_ps(_mm256_add_ps(tt0, tt1), _mm256_add_ps(tt2, tt3));
    __m256 sum1 = _mm256_add_ps(_mm256_add_ps(tt4, tt5), _mm256_add_ps(tt6, tt7));
    v0 = _mm256_blend_ps(sum0, sum1, 0xF0);
    v1 = _mm256_permute2f128_ps(sum0, sum1, 0x21); // final inter-lane shuffling
    return _mm256_add_ps(v0, v1);
}

As you can see, I'm just transposing the vectors and summing elements at the end. I'm already using two tricks here: replacing _mm256_shuffle_ps with _mm256_blend_ps where possible in order to reduce port 5 pressure on Intel CPUs as well as I'm using _mm256_permute2f128_ps + _mm256_blend_ps at the end to perform inter-lane shuffling.

Is there any better (faster) way to compute this?

Witek902 · Accepted Answer

OK, I think I have found faster algorithm based on (usually slow) HADDs:

__m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
    const __m256 s01 = _mm256_hadd_ps(v0, v1);
    const __m256 s23 = _mm256_hadd_ps(v2, v3);
    const __m256 s45 = _mm256_hadd_ps(v4, v5);
    const __m256 s67 = _mm256_hadd_ps(v6, v7);
    const __m256 s0123 = _mm256_hadd_ps(s01, s23);
    const __m256 s4556 = _mm256_hadd_ps(s45, s67);

    // inter-lane shuffle
    v0 = _mm256_blend_ps(s0123, s4556, 0xF0);
    v1 = _mm256_permute2f128_ps(s0123, s4556, 0x21);

    return _mm256_add_ps(v0, v1);
}

According to IACA, it's ~8 cycles faster on Haswell.

wim · Answer

Witek902's solution should work well, but it may suffer from high port 5 pressure, if HorizontalSums is called very often by the surrounding code.

On Intel Haswell, or newer, the vhaddps instruction decodes to 3 micro-ops: 2 port 5 (p5) micro-ops and one micro-op for p1 or p01 (see Agner Fog's instruction tables). Function sort_of_alternative_hadd_ps also decodes to 3 micro-ops, but only one of them (the shuffle) executes necessarily on p5:

inline __m256 sort_of_alternative_hadd_ps(__m256 x, __m256 y)
{
    __m256 y_hi_x_lo = _mm256_blend_ps(x, y, 0b11001100);      /* y7 y6 x5 x4 y3 y2 x1 x0 */
    __m256 y_lo_x_hi = _mm256_shuffle_ps(x, y, 0b01001110);    /* y5 y4 x7 x6 y1 y0 x3 x2 */
    return _mm256_add_ps(y_hi_x_lo, y_lo_x_hi);
}

It is possible to replace the first 4 _mm256_hadd_ps() intrinsics in Witek902's answer by the sort_of_alternative_hadd_ps function. Altogether 8 extra instructions are needed to compute the horizontal sum:

__m256 HorizontalSums_less_p5_pressure(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7)
{
    __m256 s01 = sort_of_alternative_hadd_ps(v0, v1);
    __m256 s23 = sort_of_alternative_hadd_ps(v2, v3);
    __m256 s45 = sort_of_alternative_hadd_ps(v4, v5);
    __m256 s67 = sort_of_alternative_hadd_ps(v6, v7);
    __m256 s0123 = _mm256_hadd_ps(s01, s23);
    __m256 s4556 = _mm256_hadd_ps(s45, s67);

    v0 = _mm256_blend_ps(s0123, s4556, 0xF0);
    v1 = _mm256_permute2f128_ps(s0123, s4556, 0x21);
    return _mm256_add_ps(v0, v1);
}

This compiles to:

HorizontalSums_less_p5_pressure:
        vblendps        ymm8, ymm0, ymm1, 204
        vblendps        ymm10, ymm2, ymm3, 204
        vshufps ymm0, ymm0, ymm1, 78
        vblendps        ymm9, ymm4, ymm5, 204
        vblendps        ymm1, ymm6, ymm7, 204
        vshufps ymm2, ymm2, ymm3, 78
        vshufps ymm4, ymm4, ymm5, 78
        vshufps ymm6, ymm6, ymm7, 78
        vaddps  ymm0, ymm8, ymm0
        vaddps  ymm6, ymm6, ymm1
        vaddps  ymm2, ymm10, ymm2
        vaddps  ymm4, ymm9, ymm4
        vhaddps ymm0, ymm0, ymm2
        vhaddps ymm4, ymm4, ymm6
        vblendps        ymm1, ymm0, ymm4, 240
        vperm2f128      ymm0, ymm0, ymm4, 33
        vaddps  ymm0, ymm1, ymm0
        ret

Eventually both Witek902's HorizontalSums and HorizontalSums_less_p5_pressure are decoded by the CPU into 21 micro-ops, with respectively 13 p5 micro-ops and 9 p5 micro-ops.

Depending on the surrouding code and the actual microarchitecture, this reduced port 5 pressure may improve the performance.

Computing 8 horizontal sums of eight AVX single-precision floating-point vectors

Tags:

optimization

avx

intrinsics

low-level

Witek902

2 Answers

Witek902

wim

Recent Activity

Donate For Us

Computing 8 horizontal sums of eight AVX single-precision floating-point vectors

Tags:

optimization

avx

intrinsics

low-level

Witek902

2 Answers

Witek902

wim

Related questions

Recent Activity

Donate For Us