Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX2: U8 absolute difference

Tags:

avx

simd

sse

neon

avx2

I'm new to AVX (came from ARM NEON), and was unpleasantly surprised that AVX lacks many U8 arithmetics, absolute difference being among them missing.

Hence I had to resort to max(a,b)-min(a,b) with the inline function:

static inline __m256i _mm256_abd_epu8(__m256i a, __m256i b)
{
    return _mm256_sub_epi8(_mm256_max_epu8(a, b), _mm256_min_epu8(a, b));
}

I'm curious if there are more efficient ways dealing with this problem.

And yes, I'm aware of _mm256_sad_epu8, but I need the differences themselves, NOT the sum.

I'd appreciate any input, and it's ok with AVX2, disregarding any backward compatibility.

Thanks in advance.

like image 457
Jake 'Alquimista' LEE Avatar asked Oct 27 '25 22:10

Jake 'Alquimista' LEE


1 Answers

I'm not aware of any trick for doing this with only 2 or fewer instructions. (And the SSE version of this question doesn't have anything better either: Compute the absolute difference between unsigned integers using SSE). It does mention the saturating method I used in this answer.


Slightly better on pre-Skylake: subtract both ways with unsigned saturation, then OR the results. (Either a-b or b-a saturates to zero for each element.)

_mm256_or_si256(_mm256_subs_epu8(a,b), _mm256_subs_epu8(b,a))

On Haswell, pmin/pmax and psub only run on port 1 or port 5, but por can run on any of the three vector execution ports (0, 1, 5).

Skylake adds a 3rd vector-integer adder so there's no difference on that uarch. (See http://agner.org/optimize/ and other links in the x86 tag wiki, including Intel's optimization manual.)

This is also slightly better on Ryzen, where VPOR can run on any of P0123, but PADD/PMIN can only run on P013 according to Agner Fog's testing. (Ryzen splits 256b vector ops into 2 uops, but it has the throughput for that to be useful. It can't fill its 6-uop wide pipe using only single-uop instructions.)

Uops that can run on more ports are less likely to be delayed waiting for their assigned port (resource conflict), so you're more likely to actually get 2 cycle total latency with this (from both inputs being ready to the output being ready). They're also less likely to contribute to a throughput bottleneck if there's competition for a specific port (like port 5 which has the only shuffle unit on Intel Haswell and later).

like image 156
Peter Cordes Avatar answered Oct 29 '25 18:10

Peter Cordes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!