I'm profiling a toy program (selection sort) using perf and I wonder what iterations correspond to in the perf report output. The addresses it show correspond to the inner loop and if statement. I hope somebody can help. Also, the basic block cycles column disappear when I use " -b --branch-history" with perf. I don't know why.
This is the portion of my code getting sampled (MAX_LENGTH is 500):
35 // FROM: https://www.geeksforgeeks.org/selection-sort
37 void swap(int *xp, int *yp)
38 {
39 int temp = *xp;
40 *xp = *yp;
41 *yp = temp;
42 }
43
44 void selection_sort(int arr[])
45 {
46 int i, j, min_idx;
47
48 // One by one move boundary of unsorted subarray
49 for (i = 0; i < MAX_LENGTH-1; i++)
50 {
51 // Find the minimum element in unsorted array
52 min_idx = i;
53 for (j = i+1; j < MAX_LENGTH; j++)
54 if (arr[j] < arr[min_idx])
55 min_idx = j;
56
57 // Swap the found minimum element with the first element
58 swap(&arr[min_idx], &arr[i]);
59 }
60 }
compiled using (clang version 5.0.0):
clang -O0 -g selection_sort.c -o selection_sort_g_O0
Here's how I invoke perf record:
sudo perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=1009/pp -b -g ./selection_sort_g_O0
perf report and its output:
sudo perf report -b --branch-history --no-children
Samples: 376 of event 'br_inst_retired_near_taken', Event count (approx.): 37603384
Overhead Source:Line Symbol Shared Object ▒
+ 51.86% selection_sort_g_O0[862] [.] 0x0000000000000862 selection_sort_g_O0 ▒
- 24.47% selection_sort_g_O0[86e] [.] 0x000000000000086e selection_sort_g_O0 ▒
0x873 (cycles:1) ▒
- 0x86e (cycles:1) ▒
- 23.94% 0x86e (cycles:3 iterations:25) ▒
0x862 (cycles:3) ▒
0x83f (cycles:1) ▒
0x87c (cycles:1) ▒
0x873 (cycles:1) ▒
0x86e (cycles:1) ▒
0x86e (cycles:3) ▒
0x862 (cycles:3) ▒
0x83f (cycles:1) ▒
0x87c (cycles:1) ▒
0x873 (cycles:1) ▒
0x86e (cycles:1) ▒
0x86e (cycles:3) ▒
0x862 (cycles:3) ▒
+ 22.61% selection_sort_g_O0[87c] [.] 0x000000000000087c selection_sort_g_O0 ▒
+ 1.06% selection_sort_g_O0[8a5] [.] 0x00000000000008a5 selection_sort_g_O0
I used objdump for a mapping between addresses and source file lines:
objdump -Dleg selection_sort_g_O0 > selection_sort_g_O0.s
../selection_sort.c:53
836: 8b 45 f4 mov -0xc(%rbp),%eax
839: 83 c0 01 add $0x1,%eax
83c: 89 45 f0 mov %eax,-0x10(%rbp)
83f: 81 7d f0 f4 01 00 00 cmpl $0x1f4,-0x10(%rbp)
846: 0f 8d 35 00 00 00 jge 881 <selection_sort+0x71>
../selection_sort.c:54
84c: 48 8b 45 f8 mov -0x8(%rbp),%rax
850: 48 63 4d f0 movslq -0x10(%rbp),%rcx
854: 8b 14 88 mov (%rax,%rcx,4),%edx
857: 48 8b 45 f8 mov -0x8(%rbp),%rax
85b: 48 63 4d ec movslq -0x14(%rbp),%rcx
85f: 3b 14 88 cmp (%rax,%rcx,4),%edx
862: 0f 8d 06 00 00 00 jge 86e <selection_sort+0x5e>
../selection_sort.c:55
868: 8b 45 f0 mov -0x10(%rbp),%eax
86b: 89 45 ec mov %eax,-0x14(%rbp)
../selection_sort.c:54
86e: e9 00 00 00 00 jmpq 873 <selection_sort+0x63>
../selection_sort.c:53
873: 8b 45 f0 mov -0x10(%rbp),%eax
876: 83 c0 01 add $0x1,%eax
879: 89 45 f0 mov %eax,-0x10(%rbp)
87c: e9 be ff ff ff jmpq 83f <selection_sort+0x2f>
I will try to reiterate and add some more information on top of Zulan's answer.
Last Branch Records (LBRs) allow finding the hot execution paths in an executable to examine them directly for optimization opportunities. In perf this is implemented by extending the call-stack display mechanism and adding the last basic blocks into the call stack, which is normally used to display the most common hierarchy of function calls.
This can be done by using the call graph (-g) and LBR (-b) options in perf record and the --branch-history option in perf report, which adds the last branch information to the call graph. Essentially it gives 8-32 branches extra context of why something happened.
The timed LBR feature in recent perf versions reports the average number of cycles per basic blocks.
What is Iterations ?
From what I can understand, the branch history code has a loop detection function. This allows us to obtain the number of iterations by calculating the number of removed loops.
The removal of repeated loops was introduced only in perf report output (to display it in a histogram format) via a previous commit in the Linux kernel.
struct iterations is a useful C struct to be used to display the number of iterations in perf report.
This is where the number of iterations are being saved to be displayed in your perf report output. The save_iterations function is being called from inside the remove_loops function.
The loops are being removed at the time of resolving the callchain.
You can also read this commit which describes how the perf report displays the number of iterations and changes that have been introduced in the newer Linux kernel versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With