I have a huge vector<vector<int>>
(18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With