Total Time spent by a function in an application can be broadly divided in to two components:
Typically profilers provide an estimate of the total time spent by a function. Is it possible to get an estimate of the time spent in terms of the above two components (Tcomp and Tmem)?
A notion of Arithmetic Intensity has been proposed by the Roofline model: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Simply said it defines the number of arithmetic instructions executed for each memory access.
Computing the Arithmetic Intensity is usually implemented through the use of performance counters.
Brendan Gregg in his recent blog post CPU Utilization is Wrong suggests to use instructions per cycle PMC. In short if IPC is < 1.0 than the app can be considered memory bound. Otherwise it can be considered instruction bound. Here is a relevant excerpt from his post:
If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.
If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.
For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.
Here are some examples of dummy workloads generated by stress tool and thier IPCs.
Memory bound test, IPC is low (0,02):
$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s
Performance counter stats for 'stress --vm 4 -t 3':
10767,074968 task-clock:u (msec) # 3,560 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
4 555 919 page-faults:u # 0,423 M/sec
4 290 929 426 cycles:u # 0,399 GHz
67 779 143 instructions:u # 0,02 insn per cycle
18 074 114 branches:u # 1,679 M/sec
5 398 branch-misses:u # 0,03% of all branches
3,024851934 seconds time elapsed
CPU bound test, IPC is high (1,44):
$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s
Performance counter stats for 'stress --cpu 4 -t 3':
11419,683671 task-clock:u (msec) # 3,805 CPUs utilized
0 context-switches:u # 0,000 K/sec
0 cpu-migrations:u # 0,000 K/sec
108 page-faults:u # 0,009 K/sec
30 562 187 954 cycles:u # 2,676 GHz
43 995 290 836 instructions:u # 1,44 insn per cycle
13 043 425 872 branches:u # 1142,188 M/sec
26 312 747 branch-misses:u # 0,20% of all branches
3,001218526 seconds time elapsed
It is not possible to measure this (and it does not make any sense to do that), because the computation is overlapped with the memory accesses in the current processor architectures. Also the accessing memory is usually broken down to more steps (accessing the memory, pre-fetching to the various cache levels, actual reading to the processor registers).
You can measure cache hits and misses on various cache levels to estimate the efficiency of your algorithm on your hardware using perf and its hardware counters (if supported by your hardware).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With