Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison of arrays (element by element)

An algorithm that I am working with spends a huge portion of the time comparing one array with a row of a matrix. If any ith element is the same, the algorithm calls a procedure A, if no elements are equal, procedure B is invoked instead. For example:

[1, 4, 10, 3, 5] and [5, 3, 0, 3, 0] calls A() because for the 4th position, the value is 3 in both arrays.

[1, 4, 10, 3, 5] and [5, 3, 0, 1, 0] calls B() because for the same position, the values are never the same.

Note that (1) the arrays and matrix rows have always the same size N, and (2) the algorithm calls A() when at least one value matches.

The simplest but very naïve way of doing this in C is with:

for(int i=0; i<N; i++)
   if( A[i] == B[i] ){
      flag = 1;
      break;
   }

This is still very inefficient. On the worst case, I'll have N comparisons. The real problem here is that the algorithm does trillions of these comparisons.

N (the size of the array/row in the matrix) varies from 100 to 1000. I'd like to speedup this routine. I looked at vectorization and I found that I can use cmpeq_pd. However, vectorization will still be limited because all my entries are longs. Is there anybody with an idea? Can I apply masks and such, maybe?

More information/context:

  • This is an iterative algorithm. At every iteration, I increment the matrix in one row and check the whole matrix several times. I might update a couple rows as well.
  • The likelihood for a match does not depend on the position.
  • I am willing to have false positives and negatives, in order to speedup this routine considerably.
  • If there is a match, the position in which a match is verified is not relevant (I just need to know if there is a matching position).
  • The biggest number (about 70%) of comparisons does not result in a match.
  • Parallelization is done at a different level, i.e. this kernel cannot be parallelized.
like image 967
a3mlord Avatar asked May 05 '15 13:05

a3mlord


2 Answers

I don't know whether this is applicable for the app you are developing, but operations on huge arrays are usually very well accelerated on GPU. You can expect 10-20x throughput speedup over CPU. If it's possible for your app to run the critical portion on CUDA that would make a huge difference.

like image 106
Michael Avatar answered Sep 28 '22 14:09

Michael


Although your Sandy Bridge CPU only has AVX for 256 bit SIMD (and not AVX2), and therefore lacks supports for 4 way SIMD 64 bit integer operations, I think you can still achieve 4 way SIMD using AVX floating point instructions, as follows: to compare 2 x 256 bit vectors of 64 bit integer values, v1, v2:

__m256d vcmp = _mm256_xor_pd(v1, v2); // use XOR rather than compare, so we are not 
                                      // affected by values which map to NaNs
vcmp = _mm256_cmp_pd(vcmp, _mm256_setzero_pd(), _CMP_EQ_OQ);
                                      // now we can do a valid comparison as if the
                                      // data really is double precision float
int mask = _mm256_movemask_pd(vcmp);  // extract the sign bits
bool any_eq = (mask != 0);            // if any elements matched then mask
                                      // will be non-zero

Here is an example program for test and illustration purposes:

#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>

int test(__m256d v1, __m256d v2)
{
    __m256d vcmp = _mm256_xor_pd(v1, v2);
    vcmp = _mm256_cmp_pd(vcmp, _mm256_setzero_pd(), _CMP_EQ_OQ);
    return _mm256_movemask_pd(vcmp);
}

int main()
{
    int64_t a1[4] = { 3098, 3860, 405, 3308 };
    int64_t a2[4] = { 1930, 1274, 2195, 2939 };
    int64_t a3[4] = { 1930, 1274, 405, 2939 };

    __m256i v1 = _mm256_loadu_pd((double *)a1);
    __m256i v2 = _mm256_loadu_pd((double *)a2);
    __m256i v3 = _mm256_loadu_pd((double *)a3);

    printf("mask = %d (should be == 0)\n", test(v1, v2));

    printf("mask = %d (should be != 0)\n", test(v1, v3));

    return 0;
}

Test:

$ gcc -Wall -mavx a3mlord2.c && ./a.out 
mask = 0 (should be == 0)
mask = 4 (should be != 0)
like image 31
Paul R Avatar answered Sep 28 '22 15:09

Paul R