Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to profile time spent in memory access in C/C++ applications?

Total Time spent by a function in an application can be broadly divided in to two components:

  1. Time spent on actual computation (Tcomp)
  2. Time spent on memory accesses (Tmem)

Typically profilers provide an estimate of the total time spent by a function. Is it possible to get an estimate of the time spent in terms of the above two components (Tcomp and Tmem)?

like image 305
Imran Avatar asked Nov 15 '16 09:11

Imran


3 Answers

A notion of Arithmetic Intensity has been proposed by the Roofline model: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/. Simply said it defines the number of arithmetic instructions executed for each memory access.

Computing the Arithmetic Intensity is usually implemented through the use of performance counters.

like image 51
Manuel Selva Avatar answered Nov 15 '22 12:11

Manuel Selva


Brendan Gregg in his recent blog post CPU Utilization is Wrong suggests to use instructions per cycle PMC. In short if IPC is < 1.0 than the app can be considered memory bound. Otherwise it can be considered instruction bound. Here is a relevant excerpt from his post:

If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.

For my above rules, I split on an IPC of 1.0. Where did I get that from? I made it up, based on my prior work with PMCs. Here's how you can get a value that's custom for your system and runtime: write two dummy workloads, one that is CPU bound, and one memory bound. Measure their IPC, then calculate their mid point.

Here are some examples of dummy workloads generated by stress tool and thier IPCs.
Memory bound test, IPC is low (0,02):

$ perf stat stress --vm 4 -t 3
stress: info: [4520] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [4520] successful run completed in 3s

 Performance counter stats for 'stress --vm 4 -t 3':

      10767,074968      task-clock:u (msec)       #    3,560 CPUs utilized          
                 0      context-switches:u        #    0,000 K/sec                  
                 0      cpu-migrations:u          #    0,000 K/sec                  
         4 555 919      page-faults:u             #    0,423 M/sec                  
     4 290 929 426      cycles:u                  #    0,399 GHz                    
        67 779 143      instructions:u            #    0,02  insn per cycle         
        18 074 114      branches:u                #    1,679 M/sec                  
             5 398      branch-misses:u           #    0,03% of all branches        

       3,024851934 seconds time elapsed

CPU bound test, IPC is high (1,44):

$ perf stat stress --cpu 4 -t 3
stress: info: [4465] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4465] successful run completed in 3s

 Performance counter stats for 'stress --cpu 4 -t 3':

      11419,683671      task-clock:u (msec)       #    3,805 CPUs utilized          
                 0      context-switches:u        #    0,000 K/sec                  
                 0      cpu-migrations:u          #    0,000 K/sec                  
               108      page-faults:u             #    0,009 K/sec                  
    30 562 187 954      cycles:u                  #    2,676 GHz                    
    43 995 290 836      instructions:u            #    1,44  insn per cycle         
    13 043 425 872      branches:u                # 1142,188 M/sec                  
        26 312 747      branch-misses:u           #    0,20% of all branches        

       3,001218526 seconds time elapsed
like image 7
ks1322 Avatar answered Nov 15 '22 13:11

ks1322


It is not possible to measure this (and it does not make any sense to do that), because the computation is overlapped with the memory accesses in the current processor architectures. Also the accessing memory is usually broken down to more steps (accessing the memory, pre-fetching to the various cache levels, actual reading to the processor registers).

You can measure cache hits and misses on various cache levels to estimate the efficiency of your algorithm on your hardware using perf and its hardware counters (if supported by your hardware).

like image 6
Jakuje Avatar answered Nov 15 '22 12:11

Jakuje