int *s;
allocate memory for s[100];
void func (int *a, int *b)
{
int i;
for (i = 0; i < 100; i++)
{
s[i] = a[i] ^ b[i];
}
}
Assume that this particular code snippet is being called 1000 times, and this is the most time consuming operation in my code. Also assume that addresses of a and b is changed every time. 's' is a global variable which is updated with different sets of values of a & b.
As far as I assume, the main performance bottleneck would be memory access, because the only other operation is XOR, which is very trivial.
Would you please suggest how can I optimize my code in the best possible way?
the question I really wanted to ask, but I think it didn't get properly conveyed is, let for example this for loop contains 10 such XOR operations, the loop count is 100 and the function is called 1000 times, the point is high memory access..If the code is to be executed on a single core machine, what are scopes for improvement?
I've tested proposed solutions, and other two. I was not able to test onemasse' proposal as the result saved to s[] was not correct. I was not able to fix it too. I had to do some changes on moonshadow code. The measurement unit is clock cycles, so lower is better.
Original code:
#define MAX 100
void inline STACKO ( struct timespec *ts, struct timespec *te ){
int i, *s, *a, *b;
for (i = 0; i < MAX; ++i){
s = (int *) malloc (sizeof (int)); ++s;
a = (int *) malloc (sizeof (int)); ++a;
b = (int *) malloc (sizeof (int)); ++b;
}
srand ( 1024 );
for (i = 0; i < MAX; ++i){
a[i] = ( rand() % 2 );
b[i] = ( rand() % 2 );
}
rdtscb_getticks ( ts ); /* start measurement */
for (i = 0; i < MAX; i++)
s[i] = a[i] ^ b[i];
rdtscb_getticks ( te ); /* end measurement */
/*
printf("\n");
for (i = 0; i < MAX; ++i)
printf("%d", s[i]);
printf("\n");
*/
}
New proposal 1: register int
From:
int i, *s, *a, *b;
To:
register int i, *s, *a, *b;
New proposal 2: No array notation
s_end = &s[MAX];
for (s_ptr = &s[0], a_ptr = &a[0], b_ptr = &b[0]; \
s_ptr < s_end; \
++s_ptr, ++a_ptr, ++b_ptr){
*s_ptr = *a_ptr ^ *b_ptr;
}
moonshadow proposed optimization:
s_ptr = &s[0];
a_ptr = &a[0];
b_ptr = &b[0];
for (i = 0; i < (MAX/4); i++){
s_ptr[0] = a_ptr[0] ^ b_ptr[0];
s_ptr[1] = a_ptr[1] ^ b_ptr[1];
s_ptr[2] = a_ptr[2] ^ b_ptr[2];
s_ptr[3] = a_ptr[3] ^ b_ptr[3];
s_ptr+=4; a_ptr+=4; b_ptr+=4;
}
moonshadow proposed optimization + register int:
From:
int i, *s, ...
To:
register int i, *s, ...
Christoffer proposed optimization:
#pragma omp for
for (i = 0; i < MAX; i++)
{
s[i] = a[i] ^ b[i];
}
Original Code 1036.727264
New Proposal 1 611.147928
New proposal 2 450.788845
moonshadow 713.3845
moonshadow2 452.481192
Christoffer 1054.321943
There is other simple way of optimizing the resulting binary. Passing -O2 to gcc tells that you want optimization. To know exactly what -O2 does, refer to gcc man page.
After enabling -O2:
Original Code 464.233031
New Proposal 1 452.620255
New proposal 2 454.519383
moonshadow 428.651083
moonshadow2 419.317444
Christoffer 452.079057
Source codes available at: http://goo.gl/ud52m
Don't use the loop variable to index. Unroll the loop.
for (i = 0; i < (100/4); i++)
{
s[0] = a[0] ^ b[0];
s[1] = a[1] ^ b[1];
s[2] = a[2] ^ b[2];
s[3] = a[3] ^ b[3];
s+=4; a+=4; b+=4;
}
Work out how to perform SIMD XOR on your platform.
Performing these XORs as an explicit step is potentially more expensive than doing them as part of another calculation: you're having to read from a and b and store the result in s - if s is read again for more calculation, you'd save a read and a write per iteration, and all the function call and loop overhead, by doing the XOR there instead; likewise, if a and b are outputs of some other functions, you do better by performing the XOR at the end of one of those functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With