I have come across a strange performance regression from Linux kernel 3.11 to 3.12 on x86_64 systems. Running Mark Stock's Radiance benchmark on Fedora 20, 3.12 is noticeably slower. Nothing else is changed - identical binary, identical glibc - I just boot a different kernel version, and the performance changes. The timed program, rpict, is 100% CPU bound user-level code.
Before I report this as a bug, I'd like to find the cause for this behavior. I don't know a lot about the Linux kernel, and the change log from 3.11 to 3.12 does not give me any clue.
I observed this on two systems, an Intel Haswell (i7-4771) and an AMD Richland (A8-6600K). On the Haswell system user time went from 895 sec with 3.11 to 962 with 3.12. On the Richland, from 1764 to 1844. These times are repeatable to within a few seconds.
I did some profiling with perf, and found that IPC went down in the same proportion as the slowdown. On the Haswell system, this seems to be caused by more missed branches, but why should the prediction rate go down? Radiance does use the random number generator - could "better" randomness cause the missed branches? But apart from OMAP4 support, the RNG does not have to seem changed in 3.12.
On the AMD system, perf just points to more idle backend cycles, but the cause is not clear.
Haswell system:
3.11.10 895s user, 3.74% branch-misses, 1.65 insns per cycle
3.12.6 962s user, 4.22% branch-misses, 1.52 insns per cycle
Richland system:
3.11.10 1764s user, 8.23% branch-misses, 0.75 insns per cycle
3.12.6 1844s user, 8.26% branch-misses, 0.72 insns per cycle
I also looked at a diff from the dmesg output of both kernels, but did not see anything that might have caused such a slowdown of a CPU-bound program.
I tried switching the cpufreq governor from the default ondemand to peformance but that did not have any effect.
The executable was compiled using gcc 4.7.3 but not using AVX instructions. libm still seems to use some AVX (e.g. __ieee754_pow_fma4
) but these functions are only 0.3% of total execution time.
Additional info:
Any ideas (apart from bisecting the kernel changes)?
Let’s start with an example to clearly understand how kernel regression works. In this example, a kernel regression model is developed to predict river flow from catchment area. As shown in the data below, there exists a non-linear relationship between catchment area (in square mile) and river flow (in cubic feet per sec).
A key part of the operating system kernel's success is its performance and scalability. However, discussions have appeared on the Linux* Kernel Mailing List regarding large performance regression between kernel versions.
Therefore to attain a bias-variance trade-off, bandwidth should be optimized in kernel regression. R. Tibshirani and L. Wasserman, Nonparametric Regression, Statistical Machine Learning (2015) — lecture note of Carnegie Mellon University
In this example, a kernel regression model is developed to predict river flow from catchment area. As shown in the data below, there exists a non-linear relationship between catchment area (in square mile) and river flow (in cubic feet per sec). The output, y is the river flow and input, x is the catchment area in this example.
Let's check your perf stat
outputs: http://www.chr-breitkopf.de/tmp/perf-stat.A8.txt
Kernel 3.11.10
1805057.522096 task-clock # 0.999 CPUs utilized
183,822 context-switches # 0.102 K/sec
109 cpu-migrations # 0.000 K/sec
40,451 page-faults # 0.022 K/sec
7,523,630,814,458 cycles # 4.168 GHz [83.31%]
628,027,409,355 stalled-cycles-frontend # 8.35% frontend cycles idle [83.34%]
2,688,621,128,444 stalled-cycles-backend # 35.74% backend cycles idle [33.35%]
5,607,337,995,118 instructions # 0.75 insns per cycle
# 0.48 stalled cycles per insn [50.01%]
825,679,208,404 branches # 457.425 M/sec [66.67%]
67,984,693,354 branch-misses # 8.23% of all branches [83.33%]
1806.804220050 seconds time elapsed
Kernel 3.12.6
1875709.455321 task-clock # 0.999 CPUs utilized
192,425 context-switches # 0.103 K/sec
133 cpu-migrations # 0.000 K/sec
40,356 page-faults # 0.022 K/sec
7,822,017,368,073 cycles # 4.170 GHz [83.31%]
634,535,174,769 stalled-cycles-frontend # 8.11% frontend cycles idle [83.34%]
2,949,638,742,734 stalled-cycles-backend # 37.71% backend cycles idle [33.35%]
5,607,926,276,713 instructions # 0.72 insns per cycle
# 0.53 stalled cycles per insn [50.01%]
825,760,510,232 branches # 440.239 M/sec [66.67%]
68,205,868,246 branch-misses # 8.26% of all branches [83.33%]
1877.263511002 seconds time elapsed
There are almost 300 Gcycles more for 3.12.6 in the "cycles" field; and only 6,5 Gcycles were stalls of frontend and 261 Gcycles were stalled in the backend. You have only 0,2 G of additional branch misses (each cost about 20 cycles - per optim.manual page 597; so 4Gcycles), so I think that your performance problems are related to memory subsystem problems (more realistict backend event, which can be influenced by kernel). Pagefaults diffs and migration count are low, and I think they will not slowdown test directly (but migrations may move program to the worse place).
You should go deeper into perf
counters to find the exact type of problem (it will be easier if you have shorter runs of test). The Intel's manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf will help you. Check page 587 (B.3.2) for overall events hierarchy (FE and BE stalls are here too), B.3.2.1-B.3.2.3 for info on backend stalls and how to start digging (checks for cache events, etc) and below.
How can kernel influence the memory subsystem? It can setup different virtual-to-physical mapping (hardly the your case), or it can move process farther from data. You have not-NUMA machine, but Haswell is not the exact UMA - there is a ring bus and some cores are closer to memory controller or to some parts of shared LLC (last level cache). You can test you program with taskset
utility, bounding it to some core - kernel will not move it to other core.
UPDATE: After checking your new perf stats from A8 we see that there are more DLTB-misses for 3.12.6. With changes in /proc/pid/maps (lot of short [heap]
sections instead of single [heap]
, still no exact info why), I think that there can be differences in transparent hugepage (THP; with 2M hugepages there are less TLB entries needed for the same amount of memory and less tlb misses), for example in 3.12 it can't be applied due to short heap sections.
You can check your /proc/PID/smaps
for AnonHugePages
and /proc/vmstat
for thp* values to see thp results. Values are documented here kernel.org/doc/Documentation/vm/transhuge.txt
@osgx You found the cause! After echo never > /sys/kernel/mm/transparent_hugepage/enabled, 3.11.10 takes as long as 3.12.6!
Good news!
Additional info on how to disable the randomization, and on where to report this as a bug (a 7% performance regression is quite severe) would be appreciated
I was wrong, this multi-heap section effect is not the brk randomisation (which changes only beginning of the heap). This is failure of VMA merging in do_brk
; don't know why, but some changes for VM_SOFTDIRTY were seen in mm
between 3.11.10 - 3.12.6.
UPDATE2: Possible cause of not-merging VMA:
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2580 do_brk in 3.11
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L2577 do_brk in 3.12
3.12 just added at the end of do_brk
2663 vma->vm_flags |= VM_SOFTDIRTY;
2664 return addr;
And bit above we have
2635 /* Can we just expand an old private anonymous mapping? */
2636 vma = vma_merge(mm, prev, addr, addr + len, flags,
2637 NULL, NULL, pgoff, NULL);
and inside vma_merge
there is test for vm_flags
http://lxr.missinglinkelectronics.com/linux+v3.11/mm/mmap.c#L994 3.11
http://lxr.missinglinkelectronics.com/linux+v3.12/mm/mmap.c#L994 3.12
1004 /*
1005 * We later require that vma->vm_flags == vm_flags,
1006 * so this tests vma->vm_flags & VM_SPECIAL, too.
1007 */
vma_merge --> can_vma_merge_before --> is_mergeable_vma ...
898 if (vma->vm_flags ^ vm_flags)
899 return 0;
But at time of check, new vma is not marked as VM_SOFTDIRTY, while old is already marked.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With