Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will the compiler unroll this loop?

I am creating a multi-dimensional vector (mathematical vector) where I allow basic mathematical operations +,-,/,*,=. The template takes in two parameters, one is the type (int, float etc.) while the other is the size of the vector. Currently I am applying the operations via a for loop. Now considering the size is known at compile time, will the compiler unroll the loop? If not, is there a way to unroll it with no (or minimal) performance penalty?

template <typename T, u32 size>
class Vector
{
public:
    // Various functions for mathematical operations. 
    // The functions take in a Vector<T, size>.
    // Example:
    void add(const Vector<T, size>& vec)
    {
        for (u32 i = 0; i < size; ++i)
        {
            values[i] += vec[i];
        }
    }

private:
    T   values[size];
};

Before somebody comments Profile then optimize please note that this is the basis for my 3D graphics engine and it must be fast. Second, I want to know for the sake of educating myself.

like image 811
Samaursa Avatar asked May 26 '11 16:05

Samaursa


People also ask

Do compilers unroll loops?

The compiler unrolls loops automatically at -O3 . Otherwise, any unrolling must be done in source code.

What is meant by loop unrolling?

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff.

Does Java do loop unrolling?

As a virtual machine, Java HotSpot VM has advanced loop unrolling capabilities to reduce or remove the overhead of back branches.

What is no loop unrolling?

Loop unrolling is a loop transformation technique that helps to optimize the execution time of a program. We basically remove or reduce iterations. Loop unrolling increases the program's speed by eliminating loop control instruction and loop test instructions.


2 Answers

You can do the following trick with disassembly to see how the particular code is compiled.

    Vector<int, 16> a, b;
    Vector<int, 65536> c, d;

    asm("xxx"); // marker
    a.Add(b);
    asm("yyy"); // marker
    c.Add(d);
    asm("zzz"); // marker

Now compile

gcc -O3 1.cc -S -o 1.s

And see the disasm

    xxx
# 0 "" 2
#NO_APP
    movdqa  524248(%rsp), %xmm0
    leaq    524248(%rsp), %rsi
    paddd   524184(%rsp), %xmm0
    movdqa  %xmm0, 524248(%rsp)
    movdqa  524264(%rsp), %xmm0
    paddd   524200(%rsp), %xmm0
    movdqa  %xmm0, 524264(%rsp)
    movdqa  524280(%rsp), %xmm0
    paddd   524216(%rsp), %xmm0
    movdqa  %xmm0, 524280(%rsp)
    movdqa  524296(%rsp), %xmm0
    paddd   524232(%rsp), %xmm0
    movdqa  %xmm0, 524296(%rsp)
#APP
# 36 "1.cc" 1
    yyy
# 0 "" 2
#NO_APP
    leaq    262040(%rsp), %rdx
    leaq    -104(%rsp), %rcx
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L2:
    movdqa  (%rcx,%rax), %xmm0
    paddd   (%rdx,%rax), %xmm0
    movdqa  %xmm0, (%rdx,%rax)
    addq    $16, %rax
    cmpq    $262144, %rax
    jne .L2
#APP
# 38 "1.cc" 1
    zzz

As you see, the first loop was small enough to get unrolled. The second is the loop.

like image 92
klimkin Avatar answered Oct 04 '22 10:10

klimkin


First: Modern CPUs are pretty smart about predicting branches, so unrolling the loop might not help (and could even hurt).

Second: Yes, modern compilers know how to unroll a loop like this, if it is a good idea for your target CPU.

Third: Modern compilers can even auto-vectorize the loop, which is even better than unrolling.

Bottom line: Do not think you are smarter than your compiler unless you know a lot about CPU architecture. Write your code in a simple, straightforward way, and do not worry about micro-optimizations until your profiler tells you to.

like image 33
Nemo Avatar answered Oct 04 '22 11:10

Nemo