TL;DR : Using Linux kernel real time with NO_HZ_FULL I need to isolate a process in order to have deterministic results but /proc/interrupts tell me there is still local timer interrupts (among other). How to disable it?
Long version :
I want to make sure my program is not being interrupt so I try to use a real time Linux kernel. I'm using the real time version of arch Linux (linux-rt on AUR) and I modified the configuration of the kernel to selection the following options :
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
then I reboot my computer to boot on this real time kernel with the folowing options:
nmi_watchdog=0
rcu_nocbs=1
nohz_full=1
isolcpus=1
I also disable the following option in the BIOS :
C state
intel speed step
turbo mode
VTx
VTd
hyperthreading
My CPU (i7-6700 3.40GHz) has 4 cores (8 logical CPU with hyperthreading technology) I can see CPU0, CPU1, CPU2, CPU3 in /proc/interrupts file.
CPU1 is isolated by isolcpus
kernel parameter and I want to disable the local timer interrupts on this CPU.
I though real-time kernel with CONFIG_NO_HZ_FULL and CPU isolation (isolcpus) was enough to do it and I try to check by running theses command :
cat /proc/interrupts | grep LOC > ~/tmp/log/overload_cpu1
taskset -c 1 ./overload
cat /proc/interrupts | grep LOC >> ~/tmp/log/overload_cpu1
where the overload process is:
***overload.c:***
int main()
{
for(int i=0;i<100;++i)
for(int j=0;j<100000000;++j);
}
The file overload_cpu1
contains the result:
LOC: 234328 488 12091 11299 Local timer interrupts
LOC: 239072 651 12215 11323 Local timer interrupts
meanings 651-488 = 163 interrupts from local timer and not 0...
For comparison I do the same experiment but I change the core where my process overload
run (I keep watching interrupts on CPU1):
taskset -c 0 : 8 interrupts
taskset -c 1 : 163 interrupts
taskset -c 2 : 7 interrupts
taskset -c 3 : 8 interrupts
One of my question is why there is no 0 interrupts ? why the number of interrupts is bigger when my process run on CPU1 ? (I mean I though NO_HZ_FULL will prevent interrupt if my process was alone : "The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task"(https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt)
Maybe an explaination is there is other process running on CPU1. I checked by using ps command :
CLS CPUID RTPRIO PRI NI CMD PID
TS 1 - 19 0 [cpuhp/1] 18
FF 1 99 139 - [migration/1] 20
TS 1 - 19 0 [rcuc/1] 21
FF 1 1 41 - [ktimersoftd/1] 22
TS 1 - 19 0 [ksoftirqd/1] 23
TS 1 - 19 0 [kworker/1:0] 24
TS 1 - 39 -20 [kworker/1:0H] 25
FF 1 1 41 - [posixcputmr/1] 28
TS 1 - 19 0 [kworker/1:1] 247
TS 1 - 39 -20 [kworker/1:1H] 501
As you can see, there is threads on the CPU1. Is that possible to disable these processes ? I guess it is because if it is not the case, NO_HZ_FULL will never work right ?
Tasks with class TS doesn't disturb me because they didn't have priority among SCHED_FIFO and I can set this policy to my program. Same things for tasks with class FF and priority less than 99.
However, you can see migration/1 that is in SCHED_FIFO and priority 99. Maybe these process can causes interrupts when they run . This explain the few interrupts when my process in on CPU0, CPU2 and CPU3 (respectively 8,7 and 8 interrupts) but it also mean these processes are not running very often and then doesn't explain why there is many interrupts when my process run on CPU1 (163 interrupts).
I also do the same experiment but with the SCHED_FIFO of my overload process and I get:
taskset -c 0 : 1
taskset -c 1 : 4063
taskset -c 2 : 1
taskset -c 3 : 0
In this configuration there is more interrupts in the case my process use SCHED_FIFO policy on CPU1 and less on other CPU. do you know why ?
The local timer interrupt is a timer implemented on the APIC that interrupts only a particular CPU instead of raising an interrupt that can be handled by any CPU.
Time Intervals in the Kernel. The first point we need to cover is the timer interrupt, which is the mechanism the kernel uses to keep track of time intervals.
The “nohz_full=” kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation. A cpu-list argument is passed to define the set of CPUs to isolate.
These are similar to external interrupts, but instead of firing on an external event, they fire on a timer. They are so called as they will intterupt the thread of execution after the current instruction completes, and run their code, returning to the next instruction from where it left off when it has finished.
The thing is that a full-tickless CPU (a.k.a. adaptive-ticks, configured with nohz_full=
) still receives some ticks.
Most notably the scheduler requires a timer on an isolated full tickless CPU for updating some state every second or so.
This is a documented limitation (as of 2019):
Some process-handling operations still require the occasional scheduling-clock tick. These operations include calculating CPU load, maintaining sched average, computing CFS entity vruntime, computing avenrun, and carrying out load balancing. They are currently accommodated by scheduling-clock tick every second or so. On-going work will eliminate the need even for these infrequent scheduling-clock ticks.
(source: Documentation/timers/NO_HZ.txt, cf. the LWN article (Nearly) full tickless operation in 3.10 from 2013 for some background)
A more accurate method to measure the local timer interrupts (LOC row in /proc/interrupts
) is to use perf
. For example:
$ perf stat -a -A -e irq_vectors:local_timer_entry ./my_binary
Where my_binary
has threads pinned to the isolated CPUs that non-stop utilize the CPU without invoking syscalls - for - say 2 minutes.
There are other sources of additional local timer ticks (when there is just 1 runnable task).
For example, the collection of VM stats - by default they are collected each seconds. Thus, I can decrease my LOC interrupts by setting a higher value, e.g.:
# sysctl vm.stat_interval=60
Another source are periodic checks if the TSC on the different CPUs doesn't drift - you can disable those with the following kernel option:
tsc=reliable
(Only apply this option if you really know that your TSCs don't drift.)
You might find other sources by recording traces with ftrace (while your test binary is running).
Since it came up in the comments: Yes, the SMI is fully transparent to the kernel. It doesn't show up as NMI. You can only detect an SMI indirectly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With