Optimize this assembly code

Tags:

I'm taking a Computer Architecture course right now, and we're going over basic R-type and I-type instructions (also, this is a RISC architecture), etc. I can't seem to figure out how to optimize this code.

Explanation: This code adds words in an array (pointed to by $s1) of numbers until a zero is reached. The result is stored in $t1. $t0 holds the current word.

Click to copy

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        beq   $t1, $t1, again        # Loop again
done:
        nop                          # Do nothing

I'm having a difficult time optimizing the code. I feel the beq $t1, $t1, again (since it's always true) is unnecessary, but I'm not sure how to remove it. Here's my attempt, but I now realize that my code would not terminate.

Click to copy

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        bne   $t1, $zero, again      # If result is not zero, loop
done:
        nop                          # Do nothing

I'm never checking for a terminating zero and jumping to done. But if I add another check, then wouldn't the code be the same as before?

678

asked Jan 26 '12 19:01

ardavis

1 Answers

Typically you want to convert a test at the top loop into a test at the bottom loop. To do that, you frequently have to jump into (more or less) the middle of the loop body for the first iteration. In pseudo code, what you have right now is basically:

Click to copy

    initialize sum
beginning:
    load a word
    if (done) goto end
    add to sum
    increment pointer
    goto beginning
end:

to optimize that, we want to change the structure to something like this:

Click to copy

    initialize sum
    goto start_loop

beginning:
    add to sum
    increment pointer
start_loop:
    load a word
    if (!done) goto beginning

This way there's only one jump per loop instead of two (and it's a short backward jump, so the target will nearly always be in the cache).

That said, I should add that this optimization is really mostly obsolete -- with decent branch prediction, an unconditional jump is usually essentially free.

Edit: since loop unrolling has been mentioned, I'll add my two cents about it. Branch prediction generally renders loop unrolling obsolete, unless you can use it to execute more instructions in parallel. That's not an issue here, but often useful in real life. For example, if we assume a CPU with two separate adders available, we can unroll two iterations of the loop, and add the results from those iterations into two separate target registers, so we take advantage of both adders by adding two values at the same time. Then, when the loop terminates, we add together those two registers to get the final value. In C-like psuedo-code, that would come out something like this:

Click to copy

odds = 0
evens = 0

do {   
    evens += pointer[0];
    odds += pointer[1];
    pointer += 2;
while (pointer[0] && pointer[1]);
total = odds + evens;

As written, this adds the minor extra requirements for two consecutive zeros to terminate the sequence instead of just one, and a minimum of two items in the array to be added. Note, however, that it's not really the loop unrolling that gives the major advantage here, but the parallel execution.

Absent that, we really only see a benefit from loop unrolling if a branch that's not taken is cheaper than a branch that is taken (even if both are predicted correctly). On older processors (e.g., older Intels) taking a branch did carry a penalty relative to a non-taken branch. At the same time, the unrolled loop will use more cache space. Thus, on a modern processor, it's frequently an overall loss (unless, as I said before, we can use the unrolling to gain parallelism).

155

answered Oct 13 '22 11:10

Jerry Coffin

Related questions
                            
                                Fastest x86 assembly code to synchronize access to an array? [closed]
                            
                                Crenshaw's "Let's Build a Compiler": Transcription to C and x86 Assembler? [closed]
                            
                                Modify return address on stack
                            
                                Learning SSE/SSE2 and asm optimizations
                            
                                Automatic code deduplication of assembly language?
                            
                                How to execute the CMU binary bomb in Ubuntu Linux?
                            
                                Assembly MIPS - How do I store an integer from the user into memory?
                            
                                Why move 32-bit register to stack then from stack to xmm register?
                            
                                ARM GCC generated functions prolog
                            
                                Why gcc emits unneeded memory accesses when passing trivial struct by value?
                            
                                Unneccessary pop instructions in functions with early if statement
                            
                                _InterlockedCompareExchange optimization
                            
                                Visual Studio - Compiling 32-bit code inside 64-bit project
                            
                                How can I create an executable from LLVM ir?
                            
                                Unable to include ASM header file in C without losing preprocessor
                            
                                What is the meaning of PCDATA in Go assembly
                            
                                Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
                            
                                COM calling convention on x64
                            
                                GCC/objdump: Generating compilable/buildable assembly (interspersed with C/C++) source?
                            
                                how to write inline assembly codes about LOOP in Xcode LLVM?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimize this assembly code

Tags:

assembly

computer-architecture

ardavis

People also ask

1 Answers

Jerry Coffin

Recent Activity

Donate For Us