I have 2 arrays of 16 elements (chars) that I need to "compare" and see how many elements are equal between the two.
This routine is going to be used millions of times (a usual run is about 60 or 70 million times), so I need it to be as fast as possible. I'm working on C++ (C++Builder 2007, for the record)
Right now, I have a simple:
matches += array1[0] == array2[0];
repeated 16 times (as profiling it appears to be 30% faster than doing it with a for loop)
Is there any other way that could work faster?
Some data about the environment and the data itself:
UPDATE: This answer has been modified to make my comments match the source code provided below.
There is an optimization available if you have the capability to use SSE2 and popcnt instructions.
16 bytes happens to fit nicely in an SSE register. Using c++ and assembly/intrinsics, load the two 16 byte arrays into xmm registers, and cmp them. This generates a bitmask representing the true/false condition of the compare. You then use a movmsk instruction to load a bit representation of the bitmask into an x86 register; this then becomes a bit field where you can count all the 1's to determine how many true values you had. A hardware popcnt instruction can be a fast way to count all the 1's in a register.
This requires knowledge of assembly/intrinsics and SSE in particular. You should be able to find web resources for both.
If you run this code on a machine that does not support either SSE2 or popcnt, you must then iterate through the arrays and count the differences with your unrolled loop approach.
Good luck
Edit: Since you indicated you did not know assembly, here's some sample code to illustrate my answer:
#include "stdafx.h"
#include <iostream>
#include "intrin.h"
inline unsigned cmpArray16( char (&arr1)[16], char (&arr2)[16] )
{
__m128i first = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr1 ) );
__m128i second = _mm_loadu_si128( reinterpret_cast<__m128i*>( &arr2 ) );
return _mm_movemask_epi8( _mm_cmpeq_epi8( first, second ) );
}
int _tmain( int argc, _TCHAR* argv[] )
{
unsigned count = 0;
char arr1[16] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0 };
char arr2[16] = { 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0 };
count = __popcnt( cmpArray16( arr1, arr2 ) );
std::cout << "The number of equivalent bytes = " << count << std::endl;
return 0;
}
Some notes: This function uses SSE2 instructions and a popcnt instruction introduced in the Phenom processor (that's the machine that I use). I believe the most recent Intel processors with SSE4 also have popcnt. This function does not check for instruction support with CPUID; the function is undefined if used on a processor that does not have SSE2 or popcnt (you will probably get an invalid opcode instruction). That detection code is a separate thread.
I have not timed this code; the reason I think it's faster is because it compares 16 bytes at a time, branchless. You should modify this to fit your environment, and time it yourself to see if it works for you. I wrote and tested this on VS2008 SP1.
SSE prefers data that is aligned on a natural 16-byte boundary; if you can guarantee that then you should get additional speed improvements, and you can change the _mm_loadu_si128 instructions to _mm_load_si128, which requires alignment.
The key is to do the comparisons using the largest register your CPU supports, then fallback to bytes if necessary.
The below code demonstrates with using 4-byte integers, but if you are running on a SIMD architecture (any modern Intel or AMD chip) you could compare both arrays in one instruction before falling back to an integer-based loop. Most compilers these days have intrinsic support for 128-bit types so will NOT require ASM.
(Note that for the SIMD comparisions your arrays would have to be 16-byte aligned, and some processors (e.g MIPS) would require the arrays to be 4-byte aligned for the int-based comparisons.
E.g.
int* array1 = (int*)byteArray[0];
int* array2 = (int*)byteArray[1];
int same = 0;
for (int i = 0; i < 4; i++)
{
// test as an int
if (array1[i] == array2[i])
{
same += 4;
}
else
{
// test individual bytes
char* bytes1 = (char*)(array1+i);
char* bytes2 = (char*)(array2+i);
for (int j = 0; j < 4; j++)
{
same += (bytes1[j] == bytes2[j];
}
}
}
I can't remember what exactly the MSVC compiler supports for SIMD, but you could do something like;
// depending on compiler you may have to insert the words via an intrinsic
__m128 qw1 = *(__m128*)byteArray[0];
__m128 qw2 = *(__m128*)byteArray[1];
// again, depending on the compiler the comparision may have to be done via an intrinsic
if (qw1 == qw2)
{
same = 16;
}
else
{
// do int/byte testing
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With