An algorithm that I am working with spends a huge portion of the time comparing one array with a row of a matrix. If any ith element is the same, the algorithm calls a procedure A, if no elements are equal, procedure B is invoked instead. For example:
[1, 4, 10, 3, 5]
and [5, 3, 0, 3, 0]
calls A()
because for the 4th position, the value is 3 in both arrays.
[1, 4, 10, 3, 5]
and [5, 3, 0, 1, 0]
calls B()
because for the same position, the values are never the same.
Note that (1) the arrays and matrix rows have always the same size N, and (2) the algorithm calls A()
when at least one value matches.
The simplest but very naïve way of doing this in C is with:
for(int i=0; i<N; i++)
if( A[i] == B[i] ){
flag = 1;
break;
}
This is still very inefficient. On the worst case, I'll have N comparisons. The real problem here is that the algorithm does trillions of these comparisons.
N (the size of the array/row in the matrix) varies from 100 to 1000. I'd like to speedup this routine. I looked at vectorization and I found that I can use cmpeq_pd
. However, vectorization will still be limited because all my entries are longs
. Is there anybody with an idea? Can I apply masks and such, maybe?
More information/context:
I don't know whether this is applicable for the app you are developing, but operations on huge arrays are usually very well accelerated on GPU. You can expect 10-20x throughput speedup over CPU. If it's possible for your app to run the critical portion on CUDA that would make a huge difference.
Although your Sandy Bridge CPU only has AVX for 256 bit SIMD (and not AVX2), and therefore lacks supports for 4 way SIMD 64 bit integer operations, I think you can still achieve 4 way SIMD using AVX floating point instructions, as follows: to compare 2 x 256 bit vectors of 64 bit integer values, v1
, v2
:
__m256d vcmp = _mm256_xor_pd(v1, v2); // use XOR rather than compare, so we are not
// affected by values which map to NaNs
vcmp = _mm256_cmp_pd(vcmp, _mm256_setzero_pd(), _CMP_EQ_OQ);
// now we can do a valid comparison as if the
// data really is double precision float
int mask = _mm256_movemask_pd(vcmp); // extract the sign bits
bool any_eq = (mask != 0); // if any elements matched then mask
// will be non-zero
Here is an example program for test and illustration purposes:
#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>
int test(__m256d v1, __m256d v2)
{
__m256d vcmp = _mm256_xor_pd(v1, v2);
vcmp = _mm256_cmp_pd(vcmp, _mm256_setzero_pd(), _CMP_EQ_OQ);
return _mm256_movemask_pd(vcmp);
}
int main()
{
int64_t a1[4] = { 3098, 3860, 405, 3308 };
int64_t a2[4] = { 1930, 1274, 2195, 2939 };
int64_t a3[4] = { 1930, 1274, 405, 2939 };
__m256i v1 = _mm256_loadu_pd((double *)a1);
__m256i v2 = _mm256_loadu_pd((double *)a2);
__m256i v3 = _mm256_loadu_pd((double *)a3);
printf("mask = %d (should be == 0)\n", test(v1, v2));
printf("mask = %d (should be != 0)\n", test(v1, v3));
return 0;
}
Test:
$ gcc -Wall -mavx a3mlord2.c && ./a.out
mask = 0 (should be == 0)
mask = 4 (should be != 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With