Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does it mean for code to be "hand optimized"?

This might be a really noobie question and I'm not even sure if this is the right forum to ask it but please bear with me and give me a nudge in the right direction if it's not.

I've always heard this term thrown around and I'm still not quite certain that I know what it means. What does it mean for code to be hand optimized? I've searched online and I haven't been able to find a formal definition for it, stackexchange or otherwise.

For some context, take this excerpt from the Wikipedia article on Program Optimization for example:

At the lowest level, writing code using an assembly language, designed for a particular hardware platform can produce the most efficient and compact code if the programmer takes advantage of the full repertoire of machine instructions. Many operating systems used on embedded systems have been traditionally written in assembler code for this reason. Programs (other than very small programs) are seldom written from start to finish in assembly due to the time and cost involved. Most are compiled down from a high level language to assembly and hand optimized from there. When efficiency and size are less important large parts may be written in a high-level language.

Going by context, I've presumed that it means "editing machine code by hand to optimize an algorithm" or something along those lines. But I'm still rather confused as I've heard this term used in the context of non-assembly languages such as C++ and Java as well.

like image 702
Skylar Kennedy Avatar asked Apr 04 '18 00:04

Skylar Kennedy


1 Answers

Compilers generally take a high level language like C, C++, Java, etc and compile it down to something similar, those listed into an assembly language, then behind the scenes they usually call the assembler for you and possibly the linker so that all you see is high level in and either object or final binary as an output. Run gcc with -save-temps to see some of the visible steps taken between the various programs that gcc spawns on the way to object or binary.

Compilers, written by humans, dont get tired and and are generally good, but not perfect. Nothing is perfect as my computer might have faster memory and a slower processor than yours so some definition of perfect optimization from the same source code might want a different compiler output than your computer. So even if the same target say an x86 linux machine doesnt mean there is one perfect binary. At the same time the compiler doesnt get tired give it a big file or project a complicated algorithm or even a simple one it will produce the assembly that gets assembled and so on.

This is where hand optimization comes in and basically you already quoted the answer to your question. No reason to mess with machine code, you grab the assembly language produced by the compiler by the or one of the various ways that compiler can produce that and leave it for you (or steal it by renaming the assembler and putting your own program in there, the compiler spawns it thinking it is part of the toolchain and you grab the file there). Then as a human that has or thinks they have great skills, does not have to do the entire job of creating the code for that task, but can examine the compiler output, find missed optimizations, or tune the code for their system, for whatever reason, to whatever definition of "better" they choose.

I got lucky in another question one time, but take this typical optimization.

unsigned int fun ( unsigned int a )
{
    return(a/5);
}

    00000000 <fun>:
   0:   4b02        ldr r3, [pc, #8]    ; (c <fun+0xc>)
   2:   fba3 3000   umull   r3, r0, r3, r0
   6:   0880        lsrs    r0, r0, #2
   8:   4770        bx  lr
   a:   bf00        nop
   c:   cccccccd    

It is doing a multiplication by 1/5 instead of a divide by 5. Why well more likely to find a processor with a multiply than a divide, multiply takes less logic than a divide, settles faster, while many processors will claim "one clock cycle" that is like one car coming of the side of the factor every minute, that doesnt mean it takes one minute to build a car.

but a multiply and a sometimes a shift against a constant is not-atypical for a division with the divisor known at compile time. A divide in this case would be a move immediate and a divide and maybe done, two instructions no additional memory cycle. So if the divide and move take a clock that should have been much faster than a load against say in this case a flash of a microcontroller which is often at least half the clock rate of the cpu, if not more wait states depending on the settings, something the compiler doesnt know. That load could be a killer, the extra instruction fetch could be a killer, I might happen to know that. At the same time the ip vendor in this case might have a core where the chip vendor can choose to compile the multiply in two or more clocks to significantly save on chip real estate at the cost of a little bit of performance for that one type of operation. There might not be a setting for the compiler to indicate this either if it has the ability to analyze that kind of thing anyway. This is not the kind of code you would hand optimize, but you might see these lines in a larger function output and choose to experiment.

Another might be a couple of loops:

void dummy ( unsigned int );
void fun ( unsigned int a, unsigned int b, unsigned int c )
{
    unsigned int ra;

    for(ra=0;ra<a;ra++) dummy(ra);
    for(ra=0;ra<b;ra++) dummy(ra);
}
00000000 <fun>:
   0:   e92d4070    push    {r4, r5, r6, lr}
   4:   e2506000    subs    r6, r0, #0
   8:   e1a05001    mov r5, r1
   c:   0a000005    beq 28 <fun+0x28>
  10:   e3a04000    mov r4, #0
  14:   e1a00004    mov r0, r4
  18:   e2844001    add r4, r4, #1
  1c:   ebfffffe    bl  0 <dummy>
  20:   e1560004    cmp r6, r4
  24:   1afffffa    bne 14 <fun+0x14>
  28:   e3550000    cmp r5, #0
  2c:   0a000005    beq 48 <fun+0x48>
  30:   e3a04000    mov r4, #0
  34:   e1a00004    mov r0, r4
  38:   e2844001    add r4, r4, #1
  3c:   ebfffffe    bl  0 <dummy>
  40:   e1550004    cmp r5, r4
  44:   1afffffa    bne 34 <fun+0x34>
  48:   e8bd4070    pop {r4, r5, r6, lr}
  4c:   e12fff1e    bx  lr

and that was the linked output and I happened to know that this core had an 8 word aligned (and sized) fetch. Those loops really want to move down so it only takes one fetch per loop rather than two. So I could take the assembly output and add nops somewhere in the beginning of the function before the loops to move their alignment. Now this is tedious as you any code to the project and it can change the alignment and you have to re-tune and or this adjustment can/will cause any other adjustments further down in the address space to move causing them to need to be retuned. But just an example of having some knowledge that is possibly felt to be important, leading to messing with the compiler output by hand. there would be easier ways to tune some loops like this without the pain of having to re-touch every time you change the toolchain or code.

Most are compiled down from a high level language to assembly and hand optimized from there.

The answer was in your question, the rest of that quote was setting up a situation where the author is discouraged from writing the entire project and/or function in assembly language and instead having the compiler do the grunt work and the human do some hand optimization that they felt was important or required for some reason.

EDIT, okay here is one to ponder...

unsigned int fun ( unsigned int x )
{
    return(x/5);
}

armv7-m

00000000 <fun>:
   0:   4b02        ldr r3, [pc, #8]    ; (c <fun+0xc>)
   2:   fba3 3000   umull   r3, r0, r3, r0
   6:   0880        lsrs    r0, r0, #2
   8:   4770        bx  lr
   a:   bf00        nop
   c:   cccccccd    stclgt  12, cr12, [r12], {205}  ; 0xcd

armv6-m (all thumb variants have mul not umull but mul)

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   2105        movs    r1, #5
   4:   f7ff fffe   bl  0 <__aeabi_uidiv>
   8:   bc10        pop {r4}
   a:   bc02        pop {r1}
   c:   4708        bx  r1
   e:   46c0        nop         ; (mov r8, r8)

so if I trim it to

unsigned short fun ( unsigned short x )
{
    return(x/5);
}

we would expect to see (x*0xCCCD)>>18 right? Nope, even more code.

00000000 <fun>:
   0:   b510        push    {r4, lr}
   2:   2105        movs    r1, #5
   4:   f7ff fffe   bl  0 <__aeabi_uidiv>
   8:   0400        lsls    r0, r0, #16
   a:   0c00        lsrs    r0, r0, #16
   c:   bc10        pop {r4}
   e:   bc02        pop {r1}
  10:   4708        bx  r1
  12:   46c0        nop         ; (mov r8, r8)

if a 32*32 = 64 bit unsigned multiply is good enough to do the times 1/5th thing and the compiler knows this then why doesnt it know that 16*16 = 32 bits which it has or can mask into doesnt optimized.

unsigned short fun ( unsigned short x )
{
    return((x&0xFFFF)/(5&0xFFFF));
}

So the next thing I would do is do an experiment to confirm I havent messed up my understanding of the math, (in this case try every combination against every combination against a machine with a built in divide vs the multply by 1/5 thing and see it matches). if when that passes, then hand optimize the code to avoid the library call. (Im actually doing this in some code right now thus the realization that there should be a matching optimization on the armv6-m)

#include <stdio.h>
int main ( void )
{
    unsigned int ra,rb,rc,rd;
    for(ra=0;ra<0x10000;ra++)
    {
        rb=ra/5;
        rc=(ra*0xCCCD)>>18;
        if(rb!=rc)
        {
            printf("0x%08X 0x%08X 0x%08X\n",ra,rb,rc);
        }
    }
    printf("done\n");
    return(0);
}

test passed.

like image 136
old_timer Avatar answered Jan 17 '23 16:01

old_timer