I'm trying to apply some performance engineering techniques to an implementation of Dijkstra's algorithm. In an attempt to find bottlenecks in the (naive and unoptimised) program, I'm using the perf
command to record the number of cache misses. The snippet of code that is relevant is the following, which finds the unvisited node with the smallest distance:
for (int i = 0; i < count; i++) {
if (!visited[i]) {
if (tmp == -1 || dist[i] < dist[tmp]) {
tmp = i;
}
}
}
For the LLC-load-misses
metric, perf report
shows the following annotation of the assembly:
│ for (int i = 0; i < count; i++) { ▒
1.19 │ ff: add $0x1,%eax ▒
0.03 │102: cmp 0x20(%rsp),%eax ▒
│ ↓ jge 135 ▒
│ if (!visited[i]) { ▒
0.07 │ movslq %eax,%rdx ▒
│ mov 0x18(%rsp),%rdi ◆
0.70 │ cmpb $0x0,(%rdi,%rdx,1) ▒
0.53 │ ↑ jne ff ▒
│ if (tmp == -1 || dist[i] < dist[tmp]) { ▒
0.07 │ cmp $0xffffffff,%r13d ▒
│ ↑ je fc ▒
0.96 │ mov 0x40(%rsp),%rcx ▒
0.08 │ movslq %r13d,%rsi ▒
│ movsd (%rcx,%rsi,8),%xmm0 ▒
0.13 │ ucomis (%rcx,%rdx,8),%xmm0 ▒
57.99 │ ↑ jbe ff ▒
│ tmp = i; ▒
│ mov %eax,%r13d ▒
│ ↑ jmp ff ▒
│ } ▒
│ } ▒
│ }
My question then is the following: why does the jbe
instruction produce so many cache misses? This instruction should not have to retrieve anything from memory at all if I am not mistaken. I figured it might have something to do with instruction cache misses, but even measuring only L1 data cache misses using L1-dcache-load-misses
shows that there are a lot of cache misses in that instruction.
This stumps me somewhat. Could anyone explain this (in my eyes) odd result? Thank you in advance.
About your example:
There are several instructions before and at the high counter:
│ movsd (%rcx,%rsi,8),%xmm0
0.13 │ ucomis (%rcx,%rdx,8),%xmm0
57.99 │ ↑ jbe ff
"movsd" loads word from (%rcx,%rsi,8)
(some array access) into xmm0 register, and "ucomis" loads another word from (%rcx,%rdx,8)
and compares it with just loaded value in xmm0 register. "jbe" is conditional jump which depends on compare outcome.
Many modern Intel CPUs (and AMD probably too) can and will fuse (combine) some combinations of operations (realworldtech.com/nehalem/5 "into a single uop, CMP+JCC") together, and cmp + conditional jump very common instruction combination to be fused (you can check it with Intel IACA
simulating tool, use ver 2.1 for your CPU). Fused pair may be reported in perf/PMUs/PEBS incorrectly with skew of most events towards one of two instructions.
This code probably means that expression "dist[i] < dist[tmp]" generates two memory accesses, and both of values are used in ucomis
instruction which is (partially?) fused with jbe
conditional jump. Either dist[i] or dist[tmp] or both expressions generates high number of misses. Any of such miss will block ucomis
to generate result and block jbe
to give next instruction to execute (or to retire predicted instructions). So, jbe
may get all fame of high counters instead of real memory-access instructions (and for "far" event like cache response there is some skew towards last blocked instruction).
You may try to merge visited[N] and dist[N] arrays into array[N] of struct { int visited; float dist}
to force prefetching of array[i].dist
when you access array[i].visited
or you may try to change order of vertex access, or renumber graph vertex, or do some software prefetch for next one or more elements (?)
About generic perf
event by name problems and possible uncore skew.
perf
(perf_events) tool in Linux uses predefined set of events when called as perf list
, and some listed hardware events can be not implemented; others are mapped to current CPU capabilities (and some mappings are not fully correct). Some basic info about real PMU is in your https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (but it has more details for related Nehalem-EP variant).
For your Nehalem (Intel Core i5 750 with L3 cache of 8MB and without multi-CPU/multi-socket/NUMA support) perf will map standard ("Generic cache events") LLC-load-misses
event as .. "OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS" as written in the best documentation of perf event mappings (the only one) - kernel source code
http://elixir.free-electrons.com/linux/v4.8/source/arch/x86/events/intel/core.c#L1103
u64 nehalem_hw_cache_event_ids ...
[ C(LL ) ] = {
[ C(OP_READ) ] = {
/* OFFCORE_RESPONSE.ANY_DATA.LOCAL_CACHE */
[ C(RESULT_ACCESS) ] = 0x01b7,
/* OFFCORE_RESPONSE.ANY_DATA.ANY_LLC_MISS */
[ C(RESULT_MISS) ] = 0x01b7,
...
/*
* Nehalem/Westmere MSR_OFFCORE_RESPONSE bits;
* See IA32 SDM Vol 3B 30.6.1.3
*/
#define NHM_DMND_DATA_RD (1 << 0)
#define NHM_DMND_READ (NHM_DMND_DATA_RD)
#define NHM_L3_MISS (NHM_NON_DRAM|NHM_LOCAL_DRAM|NHM_REMOTE_DRAM|NHM_REMOTE_CACHE_FWD)
...
u64 nehalem_hw_cache_extra_regs
..
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = NHM_DMND_READ|NHM_L3_ACCESS,
[ C(RESULT_MISS) ] = NHM_DMND_READ|NHM_L3_MISS,
I think this event is not precise: cpu pipeline will post (with out-of-order) load request to the cache hierarchy and will execute other instructions. After some time (around 10 cycles to reach and get response from L2 and 40 cycles to reach L3) there will be response with miss flag in the corresponding (offcore?) PMU to increment counter. On this counter overflow, profiling interrupt will be generated from this PMU. In several cpu clock cycles it will reach pipeline to interrupt it, perf_events subsystem's handler will handle this with registering current (interrupted) EIP/RIP Instruction pointer and reset PMU counter back to some negative value (for example, -100000 to get interrupt for every 100000 L3 misses counted; use perf record -e LLC-load-misses -c 100000
to set exact count or perf will autotune limit to get some default frequency). The registered EIP/RIP is not the IP of load command and it may be also not the EIP/RIP of command which wants to use the loaded data.
But if your CPU is the only socket in the system and you access normal memory (not some mapped PCI-express space), L3 miss in fact will be implemented as local memory access and there are some counters for this... (https://software.intel.com/en-us/node/596851 - "Any memory requests missing here must be serviced by local or remote DRAM").
There are some listings of PMU events for your CPU:
Official Intel's "Intel® 64 and IA-32 Architectures Software Developer Manuals" (SDM): https://software.intel.com/en-us/articles/intel-sdm, Volume 3, Appendix A
from oprofile http://oprofile.sourceforge.net/docs/intel-corei7-events.php
showevtinfo
http://www.bnikolic.co.uk/blog/hpc-prof-events.html (note, this page with Sandy Bridge list, get libpfm4 ant run on your PC to get your list). There is also check_events
tool in libpfm4 to help your encode event as raw for perf
.ocperf
tool from Intel's perf developer Andi Kleen, part of his pmu-tools https://github.com/andikleen/pmu-tools. ocperf
is just wrapper for perf and this package will download event description and any supported event name will be converted into correct raw encoding of
perf`.There should be some information about ANY_LLC_MISS offcore PMU event implementation and list of PEBS events for Nhm, but I can't find it now.
I can recommend you to use ocperf
from https://github.com/andikleen/pmu-tools with any PMU events of your CPU without need to manually encode them. There are some PEBS events in your CPU, and there is Latency profiling / perf mem
for some kind of memory access profiling (some random perf mem pdfs: 2012 post "perf: add memory access sampling support",RH 2013 - pg26-30, still not documented in 2015 - sowa pg19, ls /sys/devices/cpu/events
). For newer CPUs there are newer tools like ucevent.
I also can recommend you to try cachegrind
profiler/cache simulator tool of valgrind
program with kcachegrind
GUI to view profiles. Valgrind-based profilers may help you to get basic idea about how the code works: they collect exact instruction execution counts for every instruction, and cachegrind also simulates some abstract multi-level cache. But real CPU will execute several instruction per cycle (so, callgrind
/cachegrind
cost model of 1 instruction = 1 cpu clock cycle gives some error; cachegrind cache model have not the same logic as real cache). And all valgrind
tools are dynamic binary instrumentation tools which will slow down your program 20-30 times compared to native run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With