Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to compare one byte array with many others?

I have a loop with the following structure :

  • Calculate a byte array with length k (somewhere slow)
  • Find if the calculated byte array matches any in a list of N byte arrays I have.
  • Repeat

My loop is to be called many many times (it's the main loop of my program), and I want the second step to be as fast as possible.

The naive implementation for the second step would be using memcmp:

char* calc;
char** list;
int k, n, i;
for(i = 0; i < n; i++) {
  if (!memcmp(calc, list[i], k)) {
    printf("Matches array %d", i);
  }
}

Can you think of any faster way ? A few things :

  • My list is fixed at the start of my program, any precomputation on it is fine.
  • Let's assume that k is small (<= 64), N is moderate (around 100-1000).
  • Performance is the goal here, and portability is a non issue. Intrinsics/inline assembly is fine, as long as it's faster.

Here are a few thoughts that I had :

  • Given k<64 and I'm on x86_64, I could sort my lookup array as a long array, and do a binary search on it. O(log(n)). Even if k was big, I could sort my lookup array and do this binary search using memcmp.
  • Given k is small, again, I could compute a 8/16/32 bits checksum (the simplest being folding my arrays over themselves using a xor) of all my lookup arrays and use a built-in PCMPGT as in How to compare more than two numbers in parallel?. I know SSE4.2 is available here.

Do you think going for vectorization/sse is a good idea here ? If yes, what do you think is the best approach. I'd like to say that this isn't early optimization, but performance is crucial here, I need the outer loop to be as fast as possible. Thanks

EDIT1: It looks like http://schani.wordpress.com/tag/c-optimization-linear-binary-search-sse2-simd/ provides some interesting thoughts about it. Binary search on a list of long seems the way to go..

like image 535
Wam Avatar asked Jan 17 '14 10:01

Wam


2 Answers

The optimum solution is going to depend on how many arrays there are to match, the size of the arrays, and how often they change. I would look at avoiding doing the comparisons at all.

Assuming the list of arrays to compare it to does not change frequently and you have many such arrays, I would create a hash of each array, then when you come to compare, hash the thing you are testing. Then you only need compare the hash values. With a hash like SHA256, you can rely on this both as a positive and negative indicator (i.e. the hashes matching is sufficient to say the arrays match as well as the hashes not matching being sufficient to say the arrays differ). This would work very well if you had (say) 1,000,000 arrays to compare against which hardly ever change, as calculating the hash would be faster than 1,000,000 array comparisons.

If your number of arrays is a bit smaller, you might consider a faster non-crytographic hash. For instance, a 'hash' which simply summed the bytes in an array module 256 (this is a terrible hash and you can do much better) would eliminate the need to compare (say) 255/256ths of the target array space. You could then compare only those where the so called 'hash' matches. There are well known hash-like things such as CRC-32 which are quick to calculate.

In either case you can then have a look up by hash (modulo X) to determine which arrays to actually compare.

You suggest k is small, N is moderate (i.e. about 1000). I'm guessing speed will revolve around memory cache. Not accessing 1,000 small arrays here is going to be pretty helpful.

All the above will be useless if the arrays change with a frequency similar to the comparison.

Addition (assuming you are looking at 64 bytes or similar). I'd look into a very fast non-cryptographic hash function. For instance look at: https://code.google.com/p/smhasher/wiki/MurmurHash3

It looks like 3-4 instructions per 32 bit word to generate the hash. You could then truncate the result to (say) 12 bits for a 4096 entry hash table with very few collisions (each bucket being linked list to the target arrays). This means you would look at something like about 30 instructions to calculate the hash, then one instruction per bucket entry (expected value 1) to find the list item, then one manual compare per expected hit (that would be between 0 and 1). So rather than comparing 1000 arrays, you would compare between 0 and 1 arrays, and generate one hash. If you can't compare 999 arrays in 30-ish instructions (I'm guessing not!) this is obviously a win.

like image 58
abligh Avatar answered Sep 28 '22 07:09

abligh


We can assume that my stuff fits in 64bits, or even 32bits. If it wasn't, I could hash it so it could. But now, what's the fastest way to find whether my hash exists in the list of precomputed hashes ?

This is sort of a meta-answer, but... if your question boils down to: how can I efficiently find whether a certain 32-bit number exists in a list of other 32-bit numbers, this is a problem IP routers deal with all the time, so it might be worth looking into the networking literature to see if there's something you can adapt from their algorithms. e.g. see http://cit.mak.ac.ug/iccir/downloads/SREC_07/K.J.Poornaselvan1,S.Suresh,%20C.Divya%20Preya%20and%20C.G.Gayathri_07.pdf

(Although, I suspect they are optimized for searching through larger numbers of items than your use case..)

like image 31
JVMATL Avatar answered Sep 28 '22 07:09

JVMATL