Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?

I used llvm-mca to compute the total cycles of a pice of code, thinking they would predict its runtime. However, measuring the runtime dynamically showed almost no correlation. So: Why does the total cycles computed by llvm-mca not accurately predict the runtime? Can I predict the runtime in some better way with llvm-mca?


Details:

I wanted to know the runtime of the following code for different types of begin (and end) iterators, for startValue being 0.0 or 0ULL:

std::accumulate(begin, end, starValue)

To predict the runtimes, I used the Compiler Explorer (https://godbolt.org/z/5HDzSF) with its LLVM Machine Code Analyzer (llvm-mca) plugin, since llvm-mca is "a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance". I used the following code:

using vec_t = std::vector<double>;

vec_t generateRandomVector(vec_t::size_type size)
{
    std::random_device rnd_device;
    std::mt19937 mersenne_engine {rnd_device()};
    std::uniform_real_distribution dist{0.0,1.1};
    auto gen = [&dist, &mersenne_engine](){
        return dist(mersenne_engine);
    };
    vec_t result(size);
    std::generate(result.begin(), result.end(), gen);
    return result;
}

double start()
{
    vec_t vec = generateRandomVector(30000000);
    vec_t::iterator vectorBegin = vec.begin();
    vec_t::iterator vectorEnd = vec.end();
    __asm volatile("# LLVM-MCA-BEGIN stopwatchedAccumulate");
    double result = std::accumulate(vectorBegin, vectorEnd, 0.0);
    __asm volatile("# LLVM-MCA-END");    
    return result;
}

However, I see no correlation between the total cycles computer by llvm-mca and the wall clock time from running the corresponding std::accumulate. For example, in the code above, the Total Cycles are 2806, the runtime is 14ms. When I switch to the startValue 0ULL, the Total Cycles are 2357, but the runtime is 117ms.

like image 606
DaveFar Avatar asked Jan 09 '19 09:01

DaveFar


1 Answers

TL:DR: LLVM-MCA analyzed the entire chunk of code between those comments as if it were the body of a loop, and showed you the cycle count for 100 iterations of all those instructions.

But as well as the actually (tiny) loop, most of the instructions are loop setup and the SIMD horizontal sum after the loop that actually only run once. (That's why the cycle count is in the thousands, not 400 = 100 time the 4-cycle latency of vaddpd on Skylake for the 0.0 version with a double accumulator.)

If you uncheck the "//" box on the Godbolt compiler explorer, or modify the asm statements to add a nop like "nop # LLVM-MCA-END", you'll be able to find those lines in the asm window and see what LLVM-MCA was looking at as it's "loop".


LLVM MCA simulates a specified sequence of assembly instructions and calculates the number of cycles it's estimated to take to execute per iteration on the specified target architecture. LLVM MCA makes a number of simplifications such as (off the top of my head): (1) it assumes that all conditional branches fall through, (2) it assumes that all memory accesses are of the Write Back memory type and all hit in the L1 cache, (3) it assumes that the frontend works optimally, and (4) call instructions are not followed into the called procedure and they just fall through. There are other assumptions as well that I can't recall at the moment.

Essentially, LLVM MCA (like Intel IACA) is only accurate for backend-compute-bound, simple loops. In IACA, while most instructions are supported, a few instructions are not modeled in detail. As an example, prefetch instructions are assumed to consume only microarchitectural resources but take basically zero latency and have no impact on the state of the memory hierarchy. It appears to me that MCA completely ignores such instructions, however. Anyway, this is not particularly relevant to your question.

Now back to your code. In the Compiler Explorer link you have provided, you didn't pass any options to LLVM MCA. So the default target architecture takes effect, which is whatever architecture the tool is running on. This happens to be SKX. The total number of cycles you mentioned are for SKX, but it's not clear whether you ran the code on SKX or not. You should use the -mcpu option to specify the architecture. This is independent from the -march you passed to gcc. Also note that comparing core cycles to milliseconds is not meaningful. You can use the RDTSC instruction to measure the execution time in terms of core cycles.

Notice how the compiler has inlined the call to std::accumulate. Apparently, this code starts at assembly line 405 and the last instruction of the std::accumulate is at line 444, a total of 38 instructions. The reason why the LLVM MCA estimation will not match the actual performance has become clear now. The tool is assuming that all of these instructions are being executed in a loop for a large number of iterations. Obviously this is not the case. There is only one loop from 420-424:

.L75:
        vaddpd  ymm0, ymm0, YMMWORD PTR [rax]
        add     rax, 32
        cmp     rax, rcx
        jne     .L75

Only this code should be the input to MCA. At the source code level, there is really no way to tell MCA to only analyze this code. You'd have to manually inline std::accumulate and place the LLVM-MCA-BEGIN and LLVM-MCA-END marks somewhere inside it.

When passing 0ULL instead of 0.0 to std::accumulate, the input to LLVM MCA would start at assembly instruction 402 and end at 441. Note that any instructions not supported by MCA (such as vcvtsi2sdq) will be completely omitted from the analysis. The part of the code that is actually in a loop is:

.L78:
        vxorpd  xmm0, xmm0, xmm0
        vcvtsi2sdq      xmm0, xmm0, rax
        test    rax, rax
        jns     .L75
        mov     rcx, rax
        and     eax, 1
        vxorpd  xmm0, xmm0, xmm0
        shr     rcx
        or      rcx, rax
        vcvtsi2sdq      xmm0, xmm0, rcx
        vaddsd  xmm0, xmm0, xmm0
.L75:
        vaddsd  xmm0, xmm0, QWORD PTR [rdx]
        vcomisd xmm0, xmm1
        vcvttsd2si      rax, xmm0
        jb      .L77
        vsubsd  xmm0, xmm0, xmm1
        vcvttsd2si      rax, xmm0
        xor     rax, rdi
.L77:
        add     rdx, 8
        cmp     rsi, rdx
        jne     .L78

Note that there is a conditional jump, jns, in the code whose target address is somewhere in the block. MCA will just assume that the jump will fall through. If this was not the case in an actual run of the code, MCA will unnecessarily add the overhead of 7 instructions. There is another jump, jb, but this one I think is not important for large vectors and will fall through most of the time. The last jump, jne, is also the last instruction, so MCA will assume that the next instruction is the top one again. For a sufficiently large number of iterations, this assumption is perfectly fine.

Overall, it's obvious that first code is much smaller than the second one, and so it's probably much faster. Your measurements do confirm this. You also don't really need to use a microarchitecture analysis tool to understand why. The second code just does a lot more computations. So you can quickly conclude that passing 0.0 is better in terms of performance and code size on all architectures.

like image 166
Hadi Brais Avatar answered Oct 20 '22 18:10

Hadi Brais