Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AVX2 float compare and get 0.0 or 1.0 instead of all-0 or all-one bits

Tags:

c++

avx

simd

sse

avx2

Basically, in the resulting vector, I want to save 1.0 for all input floating point values > 1, while 0.0 for all input floating point values <= 1. Here is my code,

float f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 0.7};
float r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}

__m256i tmp1 = _mm256_cvttps_epi32(_mm256_loadu_ps(f));
__m256i tmp2 = _mm256_cmpgt_epi32(tmp1, _mm256_set1_epi32(1));
_mm256_store_ps(r, _mm256_cvtepi32_ps(tmp2));

for(int i = 0; i < 8; i++)
    std::cout << f[i] << " : " << r[i] << std::endl;

But I don't get the correct results. This is what I get. Why aren't AVX2 relational operations working properly for me?

1.2 : 0
0.5 : 0
1.7 : 0
1.9 : 0
0.34 : 0
22.9 : -1
18.6 : -1
0.7 : 0
like image 591
pythonic Avatar asked Apr 29 '17 19:04

pythonic


2 Answers

I think it's better to use _mm256_cmp_ps for your question. I have implemented the following program for this purpose. This is more than what you want. If you want to save ones you should set all mask elements to 1, but if you want to save another number you can change the mask value to whatever you want.

//gcc 6.2, Linux-mint, Skylake 
#include <stdio.h>
#include <x86intrin.h>

float __attribute__(( aligned(32))) f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 1.0};
// float __attribute__(( aligned(32))) r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}
// in C++11, use alignas(32).  Or C11 _Alignas(32), instead of GNU C __attribute__.

void printVecps(__m256 vec)
{
    float tempps[8];
    _mm256_store_ps(&tempps[0], vec);
    printf(" [0]=%3.2f, [1]=%3.2f, [2]=%3.2f, [3]=%3.2f, [4]=%3.2f, [5]=%3.2f, [6]=%3.2f, [7]=%3.2f \n",
    tempps[0],tempps[1],tempps[2],tempps[3],tempps[4],tempps[5],tempps[6],tempps[7]) ;

}

int main()
{

    __m256 mask = _mm256_set1_ps(1.0), vec1, vec2, vec3;

    vec1 = _mm256_load_ps(&f[0]);                   printf("vec1 : ");printVecps(vec1); // load vector values from f[0]-f[7]
    vec2 = _mm256_cmp_ps ( mask, vec1, _CMP_LT_OS /*0x1*/);
                                                    printf("vec2 : ");printVecps(vec2); // compare them to mask (less)
    vec3 = _mm256_min_ps (vec2 , mask);             printf("vec3 : ");printVecps(vec3); // select minimum from mask and compared results

    return 0;
}

The output for mask = {1,1,1,1,1,1,1,1} is :

vec1 :  [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00 
vec2 :  [0]=-nan, [1]=0.00, [2]=-nan, [3]=-nan, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00 
vec3 :  [0]=1.00, [1]=0.00, [2]=1.00, [3]=1.00, [4]=0.00, [5]=1.00, [6]=1.00, [7]=0.00 

And for mask = {2,2,2,2,2,2,2,2} is :

vec1 :  [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00 
vec2 :  [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00 
vec3 :  [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=2.00, [6]=2.00, [7]=0.00 

This depends on the non-commutative behaviour of _mm256_min_ps with NaNs to replace the NaN elements with 1.0. NaN > 1.0 : NaN : 1.0 = 1.0, because NaN > anything is always false.

Beware that gcc before 7.0 treats the 128b _mm_min_ps intrinsic as commutative even without -ffast-math (even though it knows the minps instruction isn't). Use an up-to-date gcc, or make sure that gcc chooses to compile your code with the operands in the order needed by this algorithm. (Or use clang). It's possible that gcc won't ever swap the operands with AVX, only with SSE (to avoid extra movapd instructions), but the safest thing is to use gcc7 or later.

like image 121
Hossein Amiri Avatar answered Nov 10 '22 13:11

Hossein Amiri


When a float is converted to int using _mm256_cvttps_epi32 then the integer returned is a truncated (round towards zero) value. That is the values 1.2, 1.7, and 1.9 are converted to 1, and they are thus not greater than 1.

The output of _mm256_cmpgt_epi32 is not 1 but "all ones", from the docs:

... if the s1 data element is greater than the corresponding element in s2, then the corresponding element in the destination vector is set to all 1s.

"All ones" is when using two's-complement integers, as your results show, minus one.

Off topic:

  • Why do you use an unaligned load and an aligned store?
  • You should take a look at _mm256_cmp_ps
like image 39
Jonas Avatar answered Nov 10 '22 12:11

Jonas