AVX2: U8 absolute difference

Question

I'm new to AVX (came from ARM NEON), and was unpleasantly surprised that AVX lacks many U8 arithmetics, absolute difference being among them missing.

Hence I had to resort to max(a,b)-min(a,b) with the inline function:

static inline __m256i _mm256_abd_epu8(__m256i a, __m256i b)
{
    return _mm256_sub_epi8(_mm256_max_epu8(a, b), _mm256_min_epu8(a, b));
}

I'm curious if there are more efficient ways dealing with this problem.

And yes, I'm aware of _mm256_sad_epu8, but I need the differences themselves, NOT the sum.

I'd appreciate any input, and it's ok with AVX2, disregarding any backward compatibility.

Thanks in advance.

Peter Cordes · Accepted Answer

I'm not aware of any trick for doing this with only 2 or fewer instructions. (And the SSE version of this question doesn't have anything better either: Compute the absolute difference between unsigned integers using SSE). It does mention the saturating method I used in this answer.

Slightly better on pre-Skylake: subtract both ways with unsigned saturation, then OR the results. (Either a-b or b-a saturates to zero for each element.)

_mm256_or_si256(_mm256_subs_epu8(a,b), _mm256_subs_epu8(b,a))

On Haswell, pmin/pmax and psub only run on port 1 or port 5, but por can run on any of the three vector execution ports (0, 1, 5).

Skylake adds a 3rd vector-integer adder so there's no difference on that uarch. (See http://agner.org/optimize/ and other links in the x86 tag wiki, including Intel's optimization manual.)

This is also slightly better on Ryzen, where VPOR can run on any of P0123, but PADD/PMIN can only run on P013 according to Agner Fog's testing. (Ryzen splits 256b vector ops into 2 uops, but it has the throughput for that to be useful. It can't fill its 6-uop wide pipe using only single-uop instructions.)

Uops that can run on more ports are less likely to be delayed waiting for their assigned port (resource conflict), so you're more likely to actually get 2 cycle total latency with this (from both inputs being ready to the output being ready). They're also less likely to contribute to a throughput bottleneck if there's competition for a specific port (like port 5 which has the only shuffle unit on Intel Haswell and later).

AVX2: U8 absolute difference

Tags:

avx

simd

sse

neon

avx2

Jake 'Alquimista' LEE

1 Answers

Peter Cordes

Recent Activity

Donate For Us

AVX2: U8 absolute difference

Tags:

avx

simd

sse

neon

avx2

Jake 'Alquimista' LEE

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us