I used llvm-mca to compute the total cycles of a pice of code, thinking they would predict its runtime. However, measuring the runtime dynamically showed almost no correlation. So: Why does the total cycles computed by llvm-mca not accurately predict the runtime? Can I predict the runtime in some better way with llvm-mca?
Details:
I wanted to know the runtime of the following code for different types of begin
(and end
) iterators, for startValue
being 0.0
or 0ULL
:
std::accumulate(begin, end, starValue)
To predict the runtimes, I used the Compiler Explorer (https://godbolt.org/z/5HDzSF) with its LLVM Machine Code Analyzer (llvm-mca) plugin, since llvm-mca is "a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance". I used the following code:
using vec_t = std::vector<double>;
vec_t generateRandomVector(vec_t::size_type size)
{
std::random_device rnd_device;
std::mt19937 mersenne_engine {rnd_device()};
std::uniform_real_distribution dist{0.0,1.1};
auto gen = [&dist, &mersenne_engine](){
return dist(mersenne_engine);
};
vec_t result(size);
std::generate(result.begin(), result.end(), gen);
return result;
}
double start()
{
vec_t vec = generateRandomVector(30000000);
vec_t::iterator vectorBegin = vec.begin();
vec_t::iterator vectorEnd = vec.end();
__asm volatile("# LLVM-MCA-BEGIN stopwatchedAccumulate");
double result = std::accumulate(vectorBegin, vectorEnd, 0.0);
__asm volatile("# LLVM-MCA-END");
return result;
}
However, I see no correlation between the total cycles computer by llvm-mca and the wall clock time from running the corresponding std::accumulate. For example, in the code above, the Total Cycles are 2806, the runtime is 14ms. When I switch to the startValue 0ULL
, the Total Cycles are 2357, but the runtime is 117ms.
TL:DR: LLVM-MCA analyzed the entire chunk of code between those comments as if it were the body of a loop, and showed you the cycle count for 100 iterations of all those instructions.
But as well as the actually (tiny) loop, most of the instructions are loop setup and the SIMD horizontal sum after the loop that actually only run once. (That's why the cycle count is in the thousands, not 400 = 100 time the 4-cycle latency of vaddpd
on Skylake for the 0.0
version with a double
accumulator.)
If you uncheck the "//" box on the Godbolt compiler explorer, or modify the asm statements to add a nop like "nop # LLVM-MCA-END"
, you'll be able to find those lines in the asm window and see what LLVM-MCA was looking at as it's "loop".
LLVM MCA simulates a specified sequence of assembly instructions and calculates the number of cycles it's estimated to take to execute per iteration on the specified target architecture. LLVM MCA makes a number of simplifications such as (off the top of my head): (1) it assumes that all conditional branches fall through, (2) it assumes that all memory accesses are of the Write Back memory type and all hit in the L1 cache, (3) it assumes that the frontend works optimally, and (4) call
instructions are not followed into the called procedure and they just fall through. There are other assumptions as well that I can't recall at the moment.
Essentially, LLVM MCA (like Intel IACA) is only accurate for backend-compute-bound, simple loops. In IACA, while most instructions are supported, a few instructions are not modeled in detail. As an example, prefetch instructions are assumed to consume only microarchitectural resources but take basically zero latency and have no impact on the state of the memory hierarchy. It appears to me that MCA completely ignores such instructions, however. Anyway, this is not particularly relevant to your question.
Now back to your code. In the Compiler Explorer link you have provided, you didn't pass any options to LLVM MCA. So the default target architecture takes effect, which is whatever architecture the tool is running on. This happens to be SKX. The total number of cycles you mentioned are for SKX, but it's not clear whether you ran the code on SKX or not. You should use the -mcpu
option to specify the architecture. This is independent from the -march
you passed to gcc. Also note that comparing core cycles to milliseconds is not meaningful. You can use the RDTSC
instruction to measure the execution time in terms of core cycles.
Notice how the compiler has inlined the call to std::accumulate
. Apparently, this code starts at assembly line 405 and the last instruction of the std::accumulate
is at line 444, a total of 38 instructions. The reason why the LLVM MCA estimation will not match the actual performance has become clear now. The tool is assuming that all of these instructions are being executed in a loop for a large number of iterations. Obviously this is not the case. There is only one loop from 420-424:
.L75:
vaddpd ymm0, ymm0, YMMWORD PTR [rax]
add rax, 32
cmp rax, rcx
jne .L75
Only this code should be the input to MCA. At the source code level, there is really no way to tell MCA to only analyze this code. You'd have to manually inline std::accumulate
and place the LLVM-MCA-BEGIN
and LLVM-MCA-END
marks somewhere inside it.
When passing 0ULL
instead of 0.0
to std::accumulate
, the input to LLVM MCA would start at assembly instruction 402 and end at 441. Note that any instructions not supported by MCA (such as vcvtsi2sdq
) will be completely omitted from the analysis. The part of the code that is actually in a loop is:
.L78:
vxorpd xmm0, xmm0, xmm0
vcvtsi2sdq xmm0, xmm0, rax
test rax, rax
jns .L75
mov rcx, rax
and eax, 1
vxorpd xmm0, xmm0, xmm0
shr rcx
or rcx, rax
vcvtsi2sdq xmm0, xmm0, rcx
vaddsd xmm0, xmm0, xmm0
.L75:
vaddsd xmm0, xmm0, QWORD PTR [rdx]
vcomisd xmm0, xmm1
vcvttsd2si rax, xmm0
jb .L77
vsubsd xmm0, xmm0, xmm1
vcvttsd2si rax, xmm0
xor rax, rdi
.L77:
add rdx, 8
cmp rsi, rdx
jne .L78
Note that there is a conditional jump, jns
, in the code whose target address is somewhere in the block. MCA will just assume that the jump will fall through. If this was not the case in an actual run of the code, MCA will unnecessarily add the overhead of 7 instructions. There is another jump, jb
, but this one I think is not important for large vectors and will fall through most of the time. The last jump, jne
, is also the last instruction, so MCA will assume that the next instruction is the top one again. For a sufficiently large number of iterations, this assumption is perfectly fine.
Overall, it's obvious that first code is much smaller than the second one, and so it's probably much faster. Your measurements do confirm this. You also don't really need to use a microarchitecture analysis tool to understand why. The second code just does a lot more computations. So you can quickly conclude that passing 0.0
is better in terms of performance and code size on all architectures.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With