I'm looking for ways to learn which syscalls or which subsystems a process or thread spends time waiting in, i.e. blocked and not scheduled to run on a CPU.
Specifically if I have some unknown process, or a process where all we know is "it's slow" I'd like to be able to learn things like:
sys_write()
on fd 13 which is /some/file"In other words when my program is not running on the CPU what is it doing?
This is astonishingly hard to answer with perf
because it does not appear to have any way to record the duration of a syscall from sys_enter to sys_exit or otherwise keep track of how long an event is. Presumably due to its sampling nature.
I'm aware of some experimental work with eBPF for Linux 4.6 and above that may help, with Brendan Gregg's off-cpu work. But in the sad world of operations and support a 4.6 kernel is a rare unicorn to be treasured.
What are the real world options?
Do ftrace, systemtap etc offer any insights here?
A system call is a request from computer software to an operating system's kernel. The Application Program Interface (API) connects the operating system's functions to user programs. It acts as a link between the operating system and a process, allowing user-level programs to request operating system services.
When responding to system calls, other traps/exceptions, and interrupts, OS code is run. The CPU automatically switches to monitor mode whenever an interrupt or trap occurs. So, the OS code is run in monitor mode. Input/output protection: Input/output is protected by making all input/output instructions privileged.
A system call is a routine that allows a user application to request actions that require special privileges. Adding system calls is one of several ways to extend the functions provided by the kernel.
strace is a diagnostic tool in Linux. It intercepts and records any syscalls made by a command. Additionally, it also records any Linux signal sent to the process. We can then use this information to debug or diagnose a program. It's especially useful if the source code of the command is not readily available.
You can use strace. First, you might want to get a high-level summary of the costs of each type of system call. You can obtain this summary by running strace -c
. For example, one possible output is the following:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
90.07 0.000263 26 10 getdents
3.42 0.000010 0 1572 read
3.42 0.000010 0 762 1 stat
3.08 0.000009 0 1574 6 open
0.00 0.000000 0 11 write
0.00 0.000000 0 1569 close
0.00 0.000000 0 48 fstat
The % time
value is with respect to overall kernel time, not overall execution time (kernel+user). This summary tells you what the most expensive system calls are. However, if you need to determine which specific instances of system calls are most expensive and what arguments are passed to them, you can run strace -i -T
. The -i
option shows the instruction addresses of the instructions that performed the system call and the -T
option the time spent in the system call. An output might look like this:
[00007f97f1b37367] open("myfile", O_RDONLY|O_CLOEXEC) = 3 <0.000020>
[00007f97f1b372f4] fstat(3, {st_mode=S_IFREG|0644, st_size=159776, ...}) = 0 <0.000018>
[00007f97f1b374ba] mmap(NULL, 159776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f97f1d19000 <0.000019>
[00007f97f1b37467] close(3) = 0 <0.000018>
The first column shows instruction addresses, the second column shows systems calls with their arguments, the third column shows the returned value, and the last column shows the time spent in that system call. This list is ordered by the dynamic occurrence of the system call. You can filter this output using either grep
or the -e
option. The instruction addresses can help you locate where in the source code these system calls are made. For example, if a long sequence of system calls have the same address, then there is a good chance that you have a loop somewhere in the code that contains the system call. If your executable binary is not PIE, the dynamic addresses are the same as the static addresses shown by objdump
. But even with PIE, the relative order of the dynamic addresses is the same. I don't know if there is an easy way to map these system calls to source code lines.
If you want to find out things like "it spends 80% of its time in sys_write() on fd 13 which is /some/file" then you need to write a script that first extracts the return values of all open
calls and the corresponding file name arguments and then sum up the times of all sys_write
calls whose fd
argument is equal to some value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With