Basically, in the resulting vector, I want to save 1.0 for all input floating point values > 1, while 0.0 for all input floating point values <= 1. Here is my code,
float f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 0.7};
float r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}
__m256i tmp1 = _mm256_cvttps_epi32(_mm256_loadu_ps(f));
__m256i tmp2 = _mm256_cmpgt_epi32(tmp1, _mm256_set1_epi32(1));
_mm256_store_ps(r, _mm256_cvtepi32_ps(tmp2));
for(int i = 0; i < 8; i++)
std::cout << f[i] << " : " << r[i] << std::endl;
But I don't get the correct results. This is what I get. Why aren't AVX2 relational operations working properly for me?
1.2 : 0
0.5 : 0
1.7 : 0
1.9 : 0
0.34 : 0
22.9 : -1
18.6 : -1
0.7 : 0
I think it's better to use _mm256_cmp_ps
for your question. I have implemented the following program for this purpose. This is more than what you want. If you want to save ones you should set all mask
elements to 1
, but if you want to save another number you can change the mask value to whatever you want.
//gcc 6.2, Linux-mint, Skylake
#include <stdio.h>
#include <x86intrin.h>
float __attribute__(( aligned(32))) f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 1.0};
// float __attribute__(( aligned(32))) r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}
// in C++11, use alignas(32). Or C11 _Alignas(32), instead of GNU C __attribute__.
void printVecps(__m256 vec)
{
float tempps[8];
_mm256_store_ps(&tempps[0], vec);
printf(" [0]=%3.2f, [1]=%3.2f, [2]=%3.2f, [3]=%3.2f, [4]=%3.2f, [5]=%3.2f, [6]=%3.2f, [7]=%3.2f \n",
tempps[0],tempps[1],tempps[2],tempps[3],tempps[4],tempps[5],tempps[6],tempps[7]) ;
}
int main()
{
__m256 mask = _mm256_set1_ps(1.0), vec1, vec2, vec3;
vec1 = _mm256_load_ps(&f[0]); printf("vec1 : ");printVecps(vec1); // load vector values from f[0]-f[7]
vec2 = _mm256_cmp_ps ( mask, vec1, _CMP_LT_OS /*0x1*/);
printf("vec2 : ");printVecps(vec2); // compare them to mask (less)
vec3 = _mm256_min_ps (vec2 , mask); printf("vec3 : ");printVecps(vec3); // select minimum from mask and compared results
return 0;
}
The output for mask = {1,1,1,1,1,1,1,1}
is :
vec1 : [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00
vec2 : [0]=-nan, [1]=0.00, [2]=-nan, [3]=-nan, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00
vec3 : [0]=1.00, [1]=0.00, [2]=1.00, [3]=1.00, [4]=0.00, [5]=1.00, [6]=1.00, [7]=0.00
And for mask = {2,2,2,2,2,2,2,2}
is :
vec1 : [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00
vec2 : [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00
vec3 : [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=2.00, [6]=2.00, [7]=0.00
This depends on the non-commutative behaviour of _mm256_min_ps
with NaNs to replace the NaN elements with 1.0. NaN > 1.0 : NaN : 1.0
= 1.0
, because NaN > anything
is always false.
Beware that gcc before 7.0 treats the 128b _mm_min_ps
intrinsic as commutative even without -ffast-math
(even though it knows the minps
instruction isn't). Use an up-to-date gcc, or make sure that gcc chooses to compile your code with the operands in the order needed by this algorithm. (Or use clang). It's possible that gcc won't ever swap the operands with AVX, only with SSE (to avoid extra movapd
instructions), but the safest thing is to use gcc7 or later.
When a float is converted to int using _mm256_cvttps_epi32 then the integer returned is a truncated (round towards zero) value. That is the values 1.2, 1.7, and 1.9 are converted to 1, and they are thus not greater than 1.
The output of _mm256_cmpgt_epi32 is not 1 but "all ones", from the docs:
... if the s1 data element is greater than the corresponding element in s2, then the corresponding element in the destination vector is set to all 1s.
"All ones" is when using two's-complement integers, as your results show, minus one.
Off topic:
_mm256_cmp_ps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With