Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization Tips

Tags:

c

optimization

int *s;
allocate memory for s[100];
void func (int *a, int *b)
{
    int i;

    for (i = 0; i < 100; i++)
    {
        s[i] = a[i] ^ b[i];
    }
}

Assume that this particular code snippet is being called 1000 times, and this is the most time consuming operation in my code. Also assume that addresses of a and b is changed every time. 's' is a global variable which is updated with different sets of values of a & b.

As far as I assume, the main performance bottleneck would be memory access, because the only other operation is XOR, which is very trivial.

Would you please suggest how can I optimize my code in the best possible way?

the question I really wanted to ask, but I think it didn't get properly conveyed is, let for example this for loop contains 10 such XOR operations, the loop count is 100 and the function is called 1000 times, the point is high memory access..If the code is to be executed on a single core machine, what are scopes for improvement?

like image 775
anup Avatar asked Dec 09 '22 12:12

anup


2 Answers

I've tested proposed solutions, and other two. I was not able to test onemasse' proposal as the result saved to s[] was not correct. I was not able to fix it too. I had to do some changes on moonshadow code. The measurement unit is clock cycles, so lower is better.

Original code:

#define MAX 100
void inline STACKO ( struct timespec *ts, struct timespec *te ){

    int i, *s, *a, *b;

    for (i = 0; i < MAX; ++i){
        s = (int *) malloc (sizeof (int)); ++s;
        a = (int *) malloc (sizeof (int)); ++a;
        b = (int *) malloc (sizeof (int)); ++b;
    }

    srand ( 1024 );
    for (i = 0; i < MAX; ++i){
        a[i] = ( rand() % 2 );
        b[i] = ( rand() % 2 );
    }

    rdtscb_getticks ( ts ); /* start measurement */

    for (i = 0; i < MAX; i++)
        s[i] = a[i] ^ b[i];

    rdtscb_getticks ( te ); /* end measurement */

/*
    printf("\n");
    for (i = 0; i < MAX; ++i)
        printf("%d", s[i]);
    printf("\n");
*/
}

New proposal 1: register int

From:

int i, *s, *a, *b;

To:

register int i, *s, *a, *b;

New proposal 2: No array notation

s_end = &s[MAX];
for (s_ptr = &s[0], a_ptr = &a[0], b_ptr = &b[0]; \
   s_ptr < s_end; \
   ++s_ptr, ++a_ptr, ++b_ptr){
    *s_ptr = *a_ptr ^ *b_ptr;
}

moonshadow proposed optimization:

s_ptr = &s[0];
a_ptr = &a[0];
b_ptr = &b[0];

for (i = 0; i < (MAX/4); i++){

    s_ptr[0] = a_ptr[0] ^ b_ptr[0];
    s_ptr[1] = a_ptr[1] ^ b_ptr[1];
    s_ptr[2] = a_ptr[2] ^ b_ptr[2];
    s_ptr[3] = a_ptr[3] ^ b_ptr[3];
    s_ptr+=4; a_ptr+=4; b_ptr+=4;
}

moonshadow proposed optimization + register int:

From:

int i, *s, ...

To:

register int i, *s, ...

Christoffer proposed optimization:

#pragma omp for
for (i = 0; i < MAX; i++)
{
    s[i] = a[i] ^ b[i];
}

Results:

Performance graph

Original Code   1036.727264
New Proposal 1  611.147928
New proposal 2  450.788845
moonshadow      713.3845
moonshadow2     452.481192
Christoffer     1054.321943

There is other simple way of optimizing the resulting binary. Passing -O2 to gcc tells that you want optimization. To know exactly what -O2 does, refer to gcc man page.

After enabling -O2: Performance graph

Original Code   464.233031
New Proposal 1  452.620255
New proposal 2  454.519383
moonshadow      428.651083
moonshadow2     419.317444
Christoffer     452.079057

Source codes available at: http://goo.gl/ud52m

like image 119
Peter Senna Avatar answered Dec 30 '22 18:12

Peter Senna


Don't use the loop variable to index. Unroll the loop.

for (i = 0; i < (100/4); i++)
{
  s[0] = a[0] ^ b[0];
  s[1] = a[1] ^ b[1];
  s[2] = a[2] ^ b[2];
  s[3] = a[3] ^ b[3];
  s+=4; a+=4; b+=4;
}

Work out how to perform SIMD XOR on your platform.

Performing these XORs as an explicit step is potentially more expensive than doing them as part of another calculation: you're having to read from a and b and store the result in s - if s is read again for more calculation, you'd save a read and a write per iteration, and all the function call and loop overhead, by doing the XOR there instead; likewise, if a and b are outputs of some other functions, you do better by performing the XOR at the end of one of those functions.

like image 20
moonshadow Avatar answered Dec 30 '22 17:12

moonshadow