Optimize nested loops for pattern-filling an array, to help the compiler produce efficient ARM assembly?

Question

I have just been given an assignment to re-write the following C function, to help the ARM compiler produce more efficient assembly code. Does anyone know how to do this?

void some_function(int *data)
{
    int  i, j;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j++)
            data[j + 64*i] = (i + j)/2;
    }
}

Yunnosch · Accepted Answer

First (as Jonathan Leffler mentioned) the compiler is likely to do so good a job already that trying to optimise by writing specific C code is usually commercially questionable, i.e. you lose more money via development time than you can make by slightly faster code.
But sometimes it is worth it; let's assume it is the case here.

If you do optimise, do so while measuring. It is very possible to write code which ends up being less optimal, because in subtle ways otherwise possible compiler optimisations are foiled. Also, whether and how much optimisation works depends on the environment, i.e. measuring in all potential environments is necessary.

Ok, after that wise-cracking, here is code in which I demonstrate optimisations as proposed in comments, one of them by Jonathan Leffler:

/* Jonathan Leffler */
void some_function(int *data)
{
    int  i, j;
    int  k = 0;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j++)
        {
            data[k++] = (i + j)/2;
        }
    }
}

/* Yunnosch 1, loop unrolling by 2 */
void some_function(int *data)
{
    int  i, j;

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j+=2)
            data[j +     64*i] = (i + j  )/2;
            data[j + 1 + 64*i] = (i + j+1)/2;
    }
}

/* Yunnosch 1 and Jonathan Leffler */
void some_function(int *data)
{
    int  i, j;
    int k=0; /* Jonathan Leffler */

    for (i = 0; i < 64; i++)
    {
        for (j = 0; j < 64; j+=2) /* Yunnosch */
        {
            data[k++] = (i + j  )/2;
            data[k++] = (i + j+1)/2; /* Yunnosch */
        }
    }
}

/* Yunnosch 2, avoiding the /2, including Jonathan Leffler */
/* Well, duh. This is harder than I thought... 
   I admit that this is NOT tested, I want to demonstrate the idea.
   Everybody feel free to help the very grateful me with fixing errors. */
void some_function(int *data)
{
    int  i, j;
    int  k=0;

    for (i = 0; i < 32; i++) /* magic numbers I normally avoid, 32 is 64/2 */
    {
        for (j = 0; j < 32; j++)
        {
            data[k     ] = (i + j);
            data[k+1   ] = (i + j);
            data[k  +64] = (i + j);
            data[k+1+64] = (i + j +1);
            k+=2;
        }
        k+=64;
    }
}

The last version is based on the following observable 2x2 group pattern in the desired result, as seen in a 2D interpretation:

00 11 ...
01 12 ...

11 22 ...
12 23 ...
.. ..
.. ..
.. ..
´´´´

Support Ukraine · Answer

Optimizing C code to generate "more efficient assembly code" for a specific compiler/processor is something you normally shouldn't do. Write clear and easy to understand C code and let the compiler do the optimization.

Even if you make all kinds of tricks with the C code and end up with "more efficient assembly code" for your specific compiler/processor, it may turn out that a simple compiler upgrade may ruin the whole thing and you'll have to change the C code again.

For something as simple as your code, write it in assembler code from the start. But be aware that you'll have to be a real expert in that processor/assembly language to beat a decent compiler.

Anyway... If we want to guess, this is my guess:

void some_function(int *data)
{
    int  i, j, x;

    for (i = 0; i < 64; i++)
    {
        // Handle even i-values
        x = i/2;
        for (j = 0; j < 64; j += 2)
        {
            *data = x;
            ++data;
            *data = x;
            ++data;
            ++x;        // Increment after writing to data twice
        }

        ++i;

        // Handle odd i-values
        x = i/2;
        for (j = 0; j < 64; j += 2)
        {
            *data = x;
            ++data;
            ++x;        // Increment after writing to data once
            *data = x;
            ++data;
        }
    }
}

The idea is 1) to replace the array-indexing with pointer increments and 2) to replace the (i+j)/2 with integer increments.

I have not done any measurement so I can't say for sure that this will be a good solution. I'll leave that to OP.

Same idea as above, but with a few more tweaks (proposed by @user3386109)

void some_function(int *data)
{
    for (int i = 0; i < 32; i++)
    {
        // when i is even, the output is in matched pairs
        int value = i;
        for (int j = 0; j < 32; j++)
        {
            *data++ = value;
            *data++ = value++;
        }

        // when i is odd, the output starts with a singleton
        // followed by matched pairs, and ending with a singleton
        value = i;
        *data++ = value++;
        for (int j = 0; j < 31; j++)
        {
            *data++ = value;
            *data++ = value++;
        }
        *data++ = value;
    }
}

Optimize nested loops for pattern-filling an array, to help the compiler produce efficient ARM assembly?

Tags:

c

assembly

micro-optimization

arm

Lai Yao Hao

2 Answers

Yunnosch

Support Ukraine

Recent Activity

Donate For Us

Optimize nested loops for pattern-filling an array, to help the compiler produce efficient ARM assembly?

Tags:

c

assembly

micro-optimization

arm

Lai Yao Hao

2 Answers

Yunnosch

Support Ukraine

Related questions

Recent Activity

Donate For Us